I Introduction
Traditionally, computers are operated by either a keyboard and a mouse or touch. They provide feedback to the user primarily via visual clues on a display. This humancomputer interaction model can be unintuitive to a human user at first, but it allows the user to express its intent clearly, as long as their goal is supported and they are equipped with sufficient knowledge to operate the machine. A sds aims to make the humancomputer interaction more intuitive by equipping computers with the ability to translate between human and computer language, thereby relieving humans of this burden and creating an intuitive interaction model. More specifically, the objective of an sds is to help a human user achieve their goal in a specific domain (eg. hotel booking), using speech as the form of communication. Recent advances in ai and rl have established the necessary technology to build the first generation of commercial sdss deployable as regular household items. Examples of such systems are Amazon’s Alexa, Google’s Home or Apple’s Siri. While initially built as voicecommand systems, over the years these systems have become capable of sustaining dialogues that can span a few turns.
Spoken dialogue systems are complex as they have to solve many challenging problems at once, under significant uncertainty. They have to recognise spoken language, decode the meaning of natural language, understand the user’s goal while keeping track of the history of a conversation, determine what information to convey to the user, convert that information into natural language, and synthesise the sentences into speech that sounds natural. This work focuses on one particular step in this pipeline: devising a policy that determines the information to convey to the user, given our belief of their goal.
This policy has been traditionally planned out by hand using flowcharts. This was a manual and inflexible process with many drawbacks that ultimately led to systems that were unable to converse intelligently. To overcome this, the policy optimisation problem has been formulated as a reinforcement learning problem [1, 2, 3]. In this formulation, the computer takes actions and gets rewards. An algorithm aims to learn a policy that maximises the rewards through learning to take the best actions based on the state of the dialogue. Since the number of possible states can be very large (potentially infinite), complex and universal function approximators such as nns have been deployed as the policy [4, 5, 6]
. There is a recent trend in the last years to model texttotext dialogues with a neural network and tackle it as a sequence to sequence model. Initial attempts to do this underestimate the fact that planning is needed and treat the problem in a purely supervised fashion
[7]. More recently RL learning has also been applied yielding improvements [8, 9, 10, 11]. While we focus here on traditional modular approaches, everything that we describe is also applicable to endtoend modelling.Using nns for policy optimisation is challenging for two reasons. First, there is often little training data available for an sds as the data often comes from real humans. The system should be able to train quickly in an online setting while the training data is being gathered from users, to make the data to be gathered useful. Neural networks often exhibit too much bias or high variance when the volume of training data is small, making it difficult to quickly train them in a stable way. Second, the success or failure of a dialogue may be the only information available to the system to train the policy on. Dialogue success depends crucially on most actions in the dialogue, making it difficult to determine which individual actions contributed to the success, or led to the failure of a dialogue. This problem is exacerbated by the large size of the state space: the system will potentially never be in the same state twice.
We address the above problems in the following ways:

We analyse the algorithm detail highlighting its theoretical advantages: low variance, safe and efficient learning.

We test the algorithm on a dialogue task with delayed rewards and test it alongside stateoftheart methods in this task

After confirming its supremacy on a small action space, we deploy the algorithm on a two orders of magnitude larger action space.

We confirm our findings in a human evaluation.
The rest of the paper is organised as follows. First, we give a brief introduction to dialogue management and define the main concepts. Then, in section III we review reinforcement learning. This is followed with an indepth description of the ACER algorithm in section IV. Then, in section V we describe the architecture deployed to allow the application of ACER to a dialogue problem. The results of the extensive evaluation is given in section VII. In section VIII we give conclusions and future work directions.
Ii Dialogue management
The job of the dialogue manager is to take the user’s dialogue acts, a semantic representation of the input, and determine the appropriate response also in the format of a dialogue act [14, 15]. The function that chooses the appropriate response is called the policy. The role of dialogue management is twofold: tracking the dialogue state and optimising the policy.
Iia Belief tracking
We call the user’s overall goal for a dialogue the user goal, i.e. booking a particular flight or finding information about a restaurant. The user works towards this goal in every dialogue turn. In each dialogue turn, the shortterm goal of the user is called the user intent. Examples of user intent are: confirm what the system said, inform the system on some criteria, and request more information on something.
The belief tracker is the memory unit of the sds, with the aim to track the user goal, the user intent and the dialogue history. For the state to satisfy the Markov property it can only depend on the previous state and the action taken. Therefore, the state needs to encode enough information about what happened in the dialogue previously to maintain the conversation. By tracking the dialogue history we ensure that the state satisfies the Markov property. The user intent is derived from the (noisy) dialogue act. To deal with the inherent uncertainty of the input, the dialogue is modelled as a pomdp [3]. The belief state
is a vector representing a probability distribution over the different goals, intents and histories that occur in the dialogue. The role of the belief tracker is to accurately estimate this probability distribution and this is normally done using a version of a recurrent neural neural network
[16].IiB Policy optimisation
A policy is a probability distribution over possible user actions given the current belief state, and is commonly written as . Here, is the action and is the output of the belief tracker, which is interpreted as a vector of probabilities^{1}^{1}1Normally, in dialogue management, the action space is discrete..
In order to define the optimal policy, we need to introduce a utility function (reward) that describes how good taking action is in state . The reward for a complete dialogue depends on whether the user was successful in reaching their goal and the length of the conversation, such that short successful dialogues are preferred. Thus, the last dialogue interaction gains a reward based on whether the dialogue was successful, and every other interaction loses a small constant reward, penalising for the length of the dialogue. The task of policy optimisation is to maximise the expected cumulative reward in any state when following policy , by choosing the optimal action from the set of possible actions . Finding the optimal policy is computationally prohibitive even for very simple POMDPs. We can view the pomdp as a continuousspace mdp in terms of policy optimisation, where the states are the belief states [17]. This allows us to apply function approximation to solve the problem of policy optimisation. This is the approach we adopt in this work.
IiC Action spaces
System actions are the dialogue acts that the system can give as a response. This is called the action space or the master action space. Due to its large size, training a dialogue policy in this action space is difficult. Some algorithms do not converge to the optimal policy, converge very slowly, or, in rare cases, have prohibitive computational demands^{2}^{2}2Since the training has to be online, i.e. happening while user input is acquired, training is constrained in computation time to prevent the user from having to wait for the system to reply. However, the training step is rarely the bottleneck..
To alleviate this problem, we use the summary action space which contains a much smaller number of actions. If a policy is trained on the summary action space, the action selected by the policy needs to be converted to a master action
. The conversion is a set of heuristics that attempts to find the optimal slots to inform on given the belief state.
Using the summary action space provides the clear benefit of a simpler dialogue policy optimisation task. On the other hand, the necessary heuristics to map to the master action space need to be manually constructed for each domain. This means that the belief state needs to be human interpretable. This limits the applicability of neural networks for belief tracking where the belief state is compactly represented as a hidden layer in the neural network.
The description of the summary and master actions that we consider is given in Appendix A.
IiD Execution mask
Not every system action is appropriate in every situation (belief state). For example, inform is not a valid action at the very beginning of the dialogue, when the system has not yet received any information on what kind of entity the user is looking for. An execution mask is constructed by the designer that ensures that only valid actions are selected by the policy: the probability of invalid actions is set to zero. The execution mask depends on the current belief state. Note that removing this mask inherently complicates the task of policy learning, as the policy then has to learn not to select inappropriate actions based on the belief state.
Iii Reinforcement learning
In rl, an agent interacts with the environment in discrete time steps. In each time step, the agent observes the environment as a belief state vector and chooses an action from the action space . After performing action , the agent observes a reward produced by the environment.
The cumulative discounted return is the future value of an episode of interactions. For the timestep, we calculate this as
The discount factor tradesoff the importance of immediate and future rewards. The goal of the agent is to find a policy that maximises the expected discounted cumulative return for every state. We define the value of a stateaction pair under policy to be the Qvalue function, the expectation of the return for belief state and action :
and the value of a state is the value function, which is the expected return only conditioned on the belief state :
In both definitions, the expectation is taken over the states the environment could be in after performing the current action, and the future actions selected by policy [18].
As the reinforcement learning scenarios become more challenging, the agent estimates value functions from trial and error by interacting with a simulated or real environment. These estimates are accurate only in the limit of infinite observations for each stateaction pair, thus a requirement for the behaviour policy is to maintain exploration, i.e. to keep visiting all stateaction pairs with nonzero probability. The behaviour policy is the policy used to generate the data during learning. For onpolicy methods, the behaviour policy is the same as the learned policy , in other words, we evaluate and improve the same policy that is used to make decisions. In contrast, offpolicy methods evaluate and improve a policy different from the one used to generate the data, i.e. the behaviour policy and the learned policy can be different. The advantage of offpolicy methods is that the optimal policy can be even while we are choosing suboptimal actions according to the behaviour policy.
Standard reinforcement learning algorithms require that the state space is discrete. Therefore, the belief state of a dialogue manager is often discretised to allow standard algorithms to be applied [19]. Alternatively, function approximation can be applied for , or . In [20], linear function approximation was applied to value functions. As parametric function approximation can limit the optimality of the solution, GPSARSA [21] instead models the Qfunction as a Gaussian process. The key here is the kernel function which models the correlation between different states and allows uncertainty to be estimated. This is crucial for learning from a small number of samples. More recently, neural network function approximation was used to approximate the Qvalue function, known as deep Qlearning (DQN), obtaining humanlevel performance across challenging video games [22]. The policy can be also modelled directly by deep networks leading to the resurgence of actorcritic methods [23]. The actor is improving its policy through interactions being directed by the critic (value function ).
Iiia Neural networks in dialogue management
A number of deep learning algorithms were previously applied to dialogue management. It has been shown in [24] that DQN enables learning strategic agents with negotiation abilities operating on a highdimensional state space. The performance of actorcritic models on taskoriented dialogue systems was analysed in [6]. These models can also be naturally bootstrapped with a small number of dialogues via supervised pretraining. They reported superior performance compared to GPSARSA in a noisefree environment. However, the compared GPSARSA did not utilise uncertainty estimates which were previously found to be crucial for effective learning [21].
Uncertainty estimates can be incorporated into DQN using BayesbyBackprop [25]. Initial results show an improvement in learning efficiency compared to vanilla DQN.
A number of recent works investigated endtoend using gradient descent techniques [26, 27, 7], where belief tracking and policy optimisation are optimised jointly. While, endtoend modelling goes beyond the scope of this work, we note that the presented algorithm is applicable also in that setting.
Iv Acer
This paper builds on recent breakthroughs in drl and applies them to the problem of dialogue management. In particular, we investigate recently proposed improvements to the actorcritic method [12]. The goal is a stable and sample efficient learning algorithm that performs well on challenging policy optimisation tasks in the sds domain. Recent advances in drl apply several methods, including experience replay [28], truncated importance sampling with bias correction [12], the offpolicy Retrace algorithm [13] and trust region policy optimisation [23] to various challenging problems. The core of this paper is to investigate to what extent these advances are applicable to the dialogue policy optimisation task with a large action space. These methods were recently combined in the acer algorithm and tested in gaming environments. To this end, we explain acer in detail and investigate the steps needed to apply it to SDS. Unlike in games, where these methods have been previously applied, we investigate dialogues with large and uncertain belief states and very large action spaces. This necessitates function approximation in reinforcement learning, but previously examined methods in sds are datainefficient, unstable or computationally too expensive. We investigate acer as a means to overcome these limitations.
Iva Actorcritic with Experience Replay
In order to use experience replay in an actorcritic method, an offpolicy version of the actorcritic method is needed. The objective is to find a policy that maximises the expected discounted return. This is equivalent to maximising the value of the initial state with input parameter vector :
Another way of expressing the same objective is to maximise the cumulative reward received from the average state [29]. For behaviour policy , let the occupancy frequency be defined as:
According to the new definition of , is weighted by because was used to collect the experience:
where is the optimal policy.
The offpolicy version of the Policy Gradient Theorem [30] is used to derive the gradients :
(1) 
The states are encountered in proportions according to just by sampling from the experience memory, so there is no need to estimate explicitly. Estimating , however, is more difficult: the offpolicy interactions are gathered according to , and we need the function under a different  current policy .
To account for this, the is weights [31] could be used. To achieve stable learning, we use an estimation method that achieves low variance by considering stateaction pairs in isolation, applying only one is weight for each.
Continuing from Equation (1), the approximation of the true gradient can be derived:
where are the is weights. The advantage function is used in place of the
function for an unbiased estimate with a lower variance:
In offpolicy setting, the advantage function is approximated as as .
IvB Lambda returns
The unbiased estimator
results in high variance, due to the is weight that has to be calculated for the entire episode. The td estimation
(2) 
only requires a single is weight. However, this estimation is biased: the value function update of the current state is based on the current estimate of the value function for the next state. This leads to slow convergence or no convergence at all.
It is possible to combine both methods and create an estimator that trades off bias and variance according to a parameter . [29] estimate as:
The constant controls the biasvariance tradeoff: setting to 0 results in an equivalent estimation as in Equation (2), with a low variance but high bias. Conversely, setting to 1 results in high variance as many is weights will be producted. This has the advantage of propagating the final reward further to the starting state which reduces bias.
IvC Retrace
The Retrace algorithm ([13]) attempts to estimate the current function from offpolicy interactions in a safe and efficient way, with small variance. Throughout this discussion, we call a method safe if its estimate of can be proven to converge to . The updated estimate of the function, is computed based on stateaction trajectories sampled from the replay memory:
(3) 
The methods that stem from this framework differ only in their definition of eligibility traces .
This framework introduces changes to the actorcritic model. Instead of approximating and with nns and estimating in a closedform equation to compute the update targets, both and are estimated with nns. is then computed from and :
(4) 
We focus on Retrace proposed by [13] where . Ideally, we need a method that is safe, has low variance and is as efficient as possible. Retrace solves this tradeoff by setting the traces “dynamically”, based on the is weights. In the near onpolicy case, it is efficient as is weights will be about 1, preventing the traces from vanishing. It has low variance because the is weights are clipped at 1. It is also safe for any and . The goal of this discussion is limited to conveying the intuition behind Retrace, but a full proof of safety is available in [13].
IvD Computational cost
Let us investigate the computational cost of deriving from in a naïve way. For each episode sampled from the replay memory, and for each stateaction pair, we need to visit the remaining part of the episode to calculate the expectation of errors under according to Equation (3). This quadratic element of the computational cost can be reduced to a linear one by deriving in a recursive way. For an episode trajectory sampled from the replay memory, Equation (3) becomes:
We will use this more computationally efficient, recursive formulation of .
IvE Importance weight truncation with bias correction
Currently, we calculate the policy gradient as:
(5) 
where the expectation is taken over the replay memory, and . An issue with this approximation is that the is weights are potentially unbounded, introducing significant variance. To solve this problem, we clip the is weights from above by a constant : . We can split the equation into two parts, one involving the truncated is weight, and the other the residual. We also need to estimate the residual, otherwise we introduce bias in the gradient estimation. We call the residual the bias correction term.
where . The weight of the bias correction term, , can still be unboundedly large. This can be solved by sampling the action from the distribution rather than [12]:
(6) 
There are two key advantages of this formulation:

The bias correction term ensures that the estimate of the gradient remains unbiased.

The bias correction term is only active when , and otherwise the formulation is equivalent to Equation (5). When active, the bias correction weight falls between 0 and 1.
To apply this method, called the truncation with bias correction trick by [12], we have to overcome a problem with the advantage function estimation. Before, we estimated for beliefaction pairs that we sampled from the replay memory, Equation (4). For the bias correction term however, only the belief is sampled from the memory, and all the actions are considered and weighted by the current policy . Due to the way is formulated, it learns from rewards, and only learns beliefaction pairs that have been visited and sampled from the replay memory. Thus the estimation is not available for the bias correction term, so we use the output of the nn, , to estimate the advantage function for that term: .
IvF Trust Region Policy Optimisation
Typically, the step size parameter in the gradient descent is calculated assuming the that the policy parameter space is Euclidian. However, this has a major shortcoming: small changes in the parameter space can lead to erratic changes in the output policy [32, 33]. This could lead to unstable learning or a learning rate too small for quick convergence. This is solved in the Natural Actor Critic algorithm by considering the natural gradient [34].
Instead of computing the exact natural gradient, we can approximate it. For the natural gradient, the distance metric tensor is the Fisher information matrix:
It can be shown [35] that
Where KL
is the KullbackLeibler divergence. Thus, instead of directly restricting the learning stepsize with the
natural gradient method, we can approximate the same method by restricting the KullbackLeibler divergence between the current policy parametrised by , and the updated policy parametrised by , for learning rate . This method is called trpo, introduced by [23]. Their method, however, relies on repeated computations of Fisher matrices for each update, which can be prohibitively expensive. [12] introduces an efficient trpo method that we will adopt instead. Our description of the method largely follows theirs with additional explanations and necessary adaptation to our discrete actionspace sds domain.To begin with, [12] proposes that the kldivergence to the updated policy should be measured not from the current policy, but from a separate average policy instead. This stabilises the algorithm by preventing it from gaining momentum in a specific direction. Instead, it is restricted to stay around a more stable average policy . The average policy is parametrised with , where represents a running average of all previous policy parameters. It is updated softly after each learning step as:
is a hyperparameter that controls the amount of history to maintain in the average policy. A value close to zero makes the average policy forget the history very quickly, reducing the effect of calculating the distances from the average policy instead of the current one. A value close to one will prevent the average policy to adjust to the current policy, or slows this adjustment process down.
trpo can be formulated as an optimisation problem, where we aim to find that minimises the L2distance between and the vanilla gradient from 6. This is a quadratic minimisation. In addition, our aim is for the divergence constraint to be formulated in a linear way, which will allow to derive a closedform solution. Since will be used for the parameter update, we have , where denotes the updated parameter vector. We can approximate the kl divergence after the policy update using a firstorder Taylor expansion:
So the increase in kl divergence in this step is
We can constrain this increase to be small by setting , such that
where the learning rate is left out, since it is a constant and can be incorporated into . Letting , the optimisation problem with linearised kl divergence constrain is [12]:
subject to 
Since the constraint is linear, the overall optimisation problem reduces to a simple quadratic programming problem. Thus, a closedform solution can be derived using the KKT conditions [36]:
IvG Summary of acer
ACER is the result of all methods presented in this section. With onpolicy exploration, it is a modified version of a2c. Both ACER and A2C use er and sample from their memories to achieve high sample efficiency. The difference between them is that acer additionally employs trpo, and that it uses a function estimator instead of a function estimator as the critic. When offpolicy, it uses truncated importance sampling with bias correction [12] to reduce the variance of is weights without adding bias. The Retrace algorithm is used to compute the targets based on the observed rewards in a safe, efficient way, with low bias and variance.
This training algorithm is presented in pseudocode (Algorithm 2), and is called from the master ACER algorithm (Algorithm 1). It performs greedy exploration, i.e. the optimal action learned so far with probability , and a random action with probability . A hyperparameter batch_size controls the number of dialogues considered for a training step, and controls the number of training steps for each new dialogue gathered. We will investigate the effect of various hyperparameters and how to set them in Section VIIE.
V ACER for dialogue modelling
In this section we detail the steps needed to apply acer to a dialogue task.
Va Learning in summary action space
Let us design the nns for actorcritic for a dialogue management task. On top of the input of the belief state, we build two hidden layers, and . The heads of the nn are the functions and . Both hidden layers and are shared between the predictors of and . Weight sharing can be beneficial as it reduces the number of parameters to train. Furthermore, in a dialogue system, we expect a strong positive correlation between and . The architecture is illustrated in Figure 1.
The activation function for layers
andis relu, which was chosen empirically as it led to faster training. The activation function for
is softmax, which converts the inputs to a probability distribution with values between 0 and 1, summing up to 1. There is no activation function for the output , as we want it to have an unlimited range, both from above and below (as rewards can be negative). All the connections in the nn are fully connected, which imposes the least structural constraints on the architecture. We perform our experiments on the Cambridge Restaurants domain, the details of which are given in Appendix A. In this domain the belief state is represented by a 268dimensional vector. This is the input of the nn. Our layerconsists of 130 neurons and
has 50 neurons. These numbers were chosen empirically, with the goal in mind to force the nn to encode all information about the belief state relevant to and in the bottleneck layer of 50 neurons, thereby learning a mapping that generalises better. The output vectors and have the dimensionality of the action space. Initially, we experiment with the summary action space, which has 15 actions (see Appendix A for details).VB Learning in master action space
VB1 Master actions for acer
In addition to applying ACER on the summary space, we also applied it on the master action space. However, to make this efficient, the nn architecture was redesigned.
In the case of the CamInfo domain (see Appendix A), there are 8 informable slots of an entity, each with a binary choice on whether we inform on it. Thus, a single inform action makes up separate master actions, only differing in what they inform on. We want to incorporate the fact that these actions are very similar into the design of the nn architecture. We achieve this by breaking the policy into a summary policy , corresponding to the 15dimensional summary action space, and a payload policy , corresponding to the choices of the payload of an inform action. We break the function up similarly into and . We reconstruct the 1035dimensional master policy (see Appendix A). and master function as follows: for each summary action ,

If does not have a payload (i.e. is not an inform action), append the corresponding summary values from and onto and .

Otherwise, for each payload of the possible choices, append to . This is because the probability of choosing action with payload is modelled as the product of the probability of choosing and that of choosing . For each , we also append to , allowing the payload network to learn an offset of achieved by choosing a particular payload.
The complete nn architecture is illustrated in Figure 2. It is important to note that only the architecture of the nns is changed and the training algorithm is unchanged. In fact, the nns are treated as a black box by acer. These output is a 1035dimensional vector for master action space.
VB2 Master actions for GPSARSA
We compare acer to GPSARSA algorithm. This is an on policy algorithm that approximates the function as a Gaussian process (GP) and therefore is very sampleefficient [21]. The key is the use of a kernel function which defines correlations between different parts of the input space. Similarly to acer, the Gaussian process (GP) method needs to be adjusted before we deploy it on master action space. The core of a GP is the kernel function, which in the case of GPSARSA is defined as:
We recall that the kernel function defines our a priori belief of the covariance between any two beliefaction pairs. This kernel is a multiplication of a scalar product of the beliefs and a Kronecker delta on the actions. The latter has the effect that any two different actions are considered completely independent. While this might be a good approximation for summary actions, a more elaborate action kernel is required for master actions. This could introduce the idea that two inform actions with slightly different payloads are expected to have similar results on the same belief state, thus showing higher covariance.
Our new action kernel returns for actions and that stem from different summary actions. Otherwise, and
are the same inform action with differing payloads. In this case, we calculate the kernel based on the cosine similarity of the two payloads, treating the payloads as vectors describing the sets of slots to inform on. Let us call
and the summary action and the payload corresponding to . is represented as a vector where each entry is either or , depending on whether the corresponding slot is informed on. Writing for the normalised version of the payload vector, the kernel becomes . Refer to [36] for a proof of being a valid kernel function.In the case of GPSARSA on master action space, the training algorithm is unchanged. Only the kernel function is adjusted to incorporate the idea of similarity between master actions. The gp can thus be trained on the 1035dimensional master action space.
Vi Limitations
It is important to highlight some limitations of this work. This work is not addressing the problem of modelling policy with large action space where there are no similarities between the system actions. On the contrary, we focus on large action spaces where we can establish some relations between the actions, either by sharing the weights in the neural network architecture as in Section VB1 or by defining special kernel functions as in Section VB2. Although, this might seem limiting, in practice, in any taskoriented dialogue, actions will bear a lot of similarities. This used to be addressed by producing a smaller summary space of distinct actions, but we believe that the proposed approach scales better, removes handcrafting and leads to the better performance. The latter hypothesis is investigated in the next section.
Vii Evaluation
In this section, we evaluate the performance of acer incorporated in an sds. We find that acer delivers the best performance and fastest convergence among the compared nnbased algorithms (enac and a2c) implemented in the PyDial dialogue toolkit [37]. We also deploy the algorithm in a more challenging setting without the execution mask aiding action selection. Next, we investigate the effect of different hyperparameter selections, and the algorithm’s stability against it. Then, we deploy acer and GP on master action space. Finally, we investigate how resilient different algorithms are to semantic errors and changing testing conditions.
Viia Evaluation setup
We compare our implementation of acer two nnbased algorithms, namely enac ([38]) and a2c and to a nonparametric algorithm gp.
Experiments are run as follows. First, the total number of dialogues or iterations () is broken down into milestones ( milestones of iterations each). As the training over the total number of iterations progresses, a snapshot of the state of the training (all nn weights, hyperparameters, and replay memory) is saved at each milestone. A separate run of 4000 iterations is then performed without any training steps, where each of the saved snapshots are tested for 200 iterations. No training and no exploration is being performed during the testing phase; instead of greedy, the greedy policy with respect to is used to derive the next action. This informs us on the performance of the system as if it stopped training at a specific milestone, allowing us to observe the speed of convergence and the performance of early milestones, discounting for the exploration.
We run the evaluation times and average the results, to reduce the variance arising from different random initialisations. We compare the average perepisode reward obtained by the agent, the average number of turns in a dialogue and the percentage of successful dialogues. The reward is defined as for a successful dialogue minus the number of turns in the dialogue. The number of maximum turns is limited to , after which, if the user did not achieve their goal, the dialogue is deemed unsuccessful. The discount factor is set to for all algorithms where it is applicable. For nnbased algorithms, the size of a minibatch, on which the training step is performed, is . For algorithms employing er, the replay memory has a capacity of interactions. For nnbased algorithms, greedy exploration is used, with linearly reducing from down to over the training process.
ViiB User simulator
We use the agendabased user simulator, with the focus belief tracker for all experiments. For details, see [37]. The agendabased user simulator [39] consists of a goal which is a randomly generated slotvalue pairs that the entity that the user seeks must be satisfied and an agenda which is a dynamic stack of dialogue acts that the user elicits in order to satisfy the goal. The simulated user consist of deerministic and stochastic decisions which govern its behaviour capable of generating complex behaviour. A typical dialogue starts by user expressing what it is looking for, or waiting for the system to prompt it. Then it checks whether the offered entity satisfies all the constraints. In that process it sometimes changes its goal and asks for something else, making it more difficult for the system to satisfies its goal. Once it settles on the offered entity, it asks for additional information, such as address or phonenumber. For a dialogue to be deemed successful, the offered entity needs to match the last user goal. Also, the system must provide all further information that the user simulator asked for. The reward is delayed and only given at the end of the dialogue. No reward is given for partially completed tasks.
ViiC Performance of acer
In the initial environment, the simulated semantic error rate is 0% both for training and testing. The learning rate
. Instead of a simple gradient descent on the loss function, we use the Adam Optimiser, which associates
momentum to the gradient [40]. To discourage the algorithm from learning a trivial policy, we subtract 1% of the policy entropy from the loss function. The acerspecific hyperparameters are: . The results are given in Figure 3 and Figure 4 where the shaded are represents a confidence interval.We observe that acer is comparable to gp in terms of speed of convergence, sample efficiency, success rate, rewards and turns. While the success rate of acer remains one or two percentage points below that of gp, acer requires fewer dialogue turns and ultimately obtains somewhat higher rewards than gp. This suggests that the slightly worse success rate of acer presents a shortcoming of the reward function rather than the algorithm, as the algorithm only optimises the reward function. We also observe that acer far exceeds the performance of other nnbased methods in terms of all of speed of convergence, sample efficiency, success rate and rewards.
ViiD Effect of execution mask
We run our experiments with and without the execution mask and compare success rates (Figure 3 and Figure 6).
In general, as expected, algorithms converge slower without the execution mask, while the final performance of gp and acer remain somewhat below their performances with the mask. This is also expected as a mapping learned by rl is rarely as precise as a hardcoded solution to a problem (execution mask). gp shows faster initial convergence than acer, as the latter shows a more steady progress without unexpected dips in performance. They remain comparable in every other regard.
ViiE Hyperparameter tuning
acer has several additional hyperparameters compared to more traditional algorithms. We investigate the effect of hyperparameters on the algorithm’s performance. To better illustrate the differences, we run the tests in a more challenging setting, without the execution mask. For every analysed parameter, we kept the rest of the hyperparameters set to values providing the best results from section .
Importance Weight threshold
This value is the upper bound of is weight; weights higher than are truncated. Setting this value too high diminishes the effect of weight truncation, while a value too low will rely more on the less accurate bias correction term. From Figure 7, we see that delivers the highest convergence rate and a good final performance. We also see that for the wide range of values from to , there is no big difference in final performance, suggesting that the algorithm is relatively stable in face of varying this hyperparameter.
KL divergence constraint
This value constrains the KL divergence between an updated policy and the running average policy. Setting it too high allows radical jumps, setting it too low slows the convergence down (Figure 8). We can see that a setting of or results in erratic changes in the performance of ACER, while and are sensible choices.
Average policy update weight
In Figure 9, we can see that for , the average policy forgets the history too quickly, allowing the policy to gain momentum in any direction and thus preventing it from converging to a good performance. For , the policy converges quickly, while results in a somewhat conservative algorithm, where the KL divergence constraint keeps the policy near a slowly changing average. still converges to a good result, but does so somewhat slower than in case of .
Training iterations
Setting the number of training steps per episode higher allows the algorithm to learn more from the gathered experience. However, if is too high, the training might diverge due to the policy moving too much (Figure 10). For , convergence is quick and performance is good, while for , performance stays poor throughout. For and , the algorithm diverges completely.
ViiF Master action space
ACER compares favourably to other nnbased algorithms, but performs about equally if not slightly worse than GP in our experiments. The experiments were run on the summary action space, which only has actions. In a more difficult scenario, we may have orders of magnitude more actions. In such scenarios, the computational cost of GPs can be prohibitive as it needs to invert the Gram matrix [21]. If ACER still performs well under the same scenario, it might may be the overall best method to apply to larger action spaces. This is because ACER does not have the prohibitive computational cost of GP, and is expected to train much more quickly.
To test our hypotheses, we deploy ACER on the master action space according to Section VB1, and on the summary space (Figure 11). Both experiments were run with the execution mask. Convergence is slower on the master action space. This is expected due to having to choose between vastly higher number of actions on the master action space ( as opposed to ). However, ACER is still surprisingly effective on the master action space, converging to about the same performance as on the summary space. We note that this is without any modification to the training algorithm, only the underlying nn is changed. ACER achieves the best results in terms of speed of convergence and final performance on master action space out of NNbased SDS policy optimiser algorithms.
To investigate further whether ACER is the best choice of algorithm on the master action space, we modify GP to run on master action space according to Section VB2. We compare ACER and GP both on summary and master action spaces, without the execution mask in Figure 12. Both GP and ACER show slower speed of convergence on master action space. This is expected, as the random initialisation of a policy on master action space will be much less sensible than an initialisation on the summary space, the latter taking advantage of the hardcoded summary to master action mapping method. However, it is surprising to see that all experiments converged to roughly the same performance of about 97% success rate, except for GP on summary, which has a final success rate of 98%99%. This suggests that both ACER and GP can handle large action spaces quite efficiently. To our knowledge, this is the first time learning on the master action space from scratch was successfully attempted.
GP is more sample efficient than ACER on the challenging master action space without execution mask. However, it requires vastly more computational resources to run: this experiment took hours to run with ACER, and days with GP^{3}^{3}3The running times were measured on an Azure cloud machine with a 16core CPU and 64GB of RAM.. Arguably, the extra computational cost overshadows the disadvantage of ACER, that it has to be run for more iterations to converge.
ViiG Noise robustness
So far, our experiment settings were quite idealised, training and testing policies under a perfect simulator with no semantic errors. However, in real life the automatic speech recognition (ASR) component is very likely to make errors as well as the spoken language understanding (SLU) component. Therefore, in reality, the pipeline surrounding the policy optimiser deals with substantial uncertainty, which tends to introduce errors [41]. We ultimately want to measure how well a policy optimiser can learn the optimal strategy in face of noisy semanticlevel input. In our experiments, we control this by the semantic error rate, the rate at which a random noisy input is introduced to the optimiser to simulate an error scenario. In other words, a 15% semantic error rate means that with 0.15 probability a semantic concept (slot, value or dialogue act type) presented to the dialogue manager is incorrect. We focus on two desirable properties of a policy. First, ideally, the policy would learn not to trust the input as much, and ask questions until it is sure about the user goal, just like a real human would if the telephone line is noisy. Second, an ideal policy would not only adjust to the error rate of the training conditions, but would dynamically adjust to the conditions of the dialogue it is in. If the policy adjusts too much to the training conditions, it is said to overfit. This could severely limit the policy’s deployability.
We test key algorithms for these two desirable properties. eNAC, the best known NNbased policy optimiser [5] to this date, is compared to ACER and GP. ACER and GP are also compared to their respective variants in master action space. We run the test as follows: first, we train the algorithms under semantic error rate until convergence, with the execution mask. Then we take the fully trained policy and test it under a range of semantic error rates, ranging from to to measure the policies’ generalisation properties. This is something that is never the case in games so this aspect of learning is rarely examined but it is of utmost importance for spoken dialogue systems. We present results in Figure 13 and Figure 14 with a 95% confidence interval.
Success rate and reward follow the same trends. As expected, we see a general downwards trend for each algorithm as the semantic error rate increases. There is however no apparent spike in performance at the 15% semantic error rate of the training process, indicating that none of the algorithms overfit to this setting. We can see that the performance of eNAC is far behind all the other algorithms. ACER and GP are closer in performance, but GP on summary space consistently beats ACER on summary space.
It might be surprising that both ACER and GP perform better when trained on the master action space as opposed to the summary space, given that they performed worse in previous experiments. However, those experiments had no semantic errors, and a handcrafted rigid mapping from summary actions to master actions, that relied on the belief state to find the best payload for an inform action. Under a higher semantic error rate, the belief state will be noisy and this mapping may not perform optimally. This highlights the benefits of expanding the scope of artificial intelligence in SDS: AI can be more versatile than handcoded mappings, especially when the mapping performs decision making under uncertainty.
ViiH Human evaluation
In the previous sections, the training and testing is performed on the same simulated user. To test the generalisation capabilities of the proposed methods, we evaluate the trained dialogue policies in interaction with human users in a similar setup as in [42]. To recruit the users, we use the Amazon Mechanical Turk (AMT) service where volunteers can call our dialogue system and rate it. Around dialogues were gathered. Three policies (GP and ACER on summary action space and ACER on master action space) were trained with semantic error rate to accommodate for ASR errors using setup from previous sections. Then, learnt policies were incorporated into SDS pipeline with a commercial ASR system.
The MTurk users were asked to find restaurants that have particular features as defined by the given task. Subjects were randomly allocated to one of the three analysed systems. After each dialogue the users were asked whether they judged the dialogue to be successful or not which is then translated to a reward measure. Table I
presents averaged results with one standard deviation. All models differ indiscernibly with regards to success rate performing very well. However, ACER trained on master action space achieves considerably higher reward (and in turn smaller number of turns) than models working on summary action space.
GP summary  ACER summary  ACER master  

Success rate  89.7%  88.7%  89.1% 
Reward  11.29 ( 7.54)  11.39 ( 7.17)  11.83 ( 8.05) 
No. of turns  6.61 ( 3.12)  6.42 ( 2.84)  5.98 ( 3.22) 
Viii Conclusion
The policy optimisation algorithms presented in this paper improves the stateoftheart in spoken dialogue systems (SDS) in three ways:

Our implementation of GP with a redesigned kernel function achieves the best performance on master action space, which previously was not possible.
GP suffers from an inherently high computational cost, making the algorithm unsuitable in higher volume action spaces. In such cases, the fact that ACER can be trained well on the master action space indicates that it may be the best currently known method to train policies with large action spaces.
As agents powered by machine learning gain more intelligence, they can be applied to more challenging domains. Using the master action space is a good example of this: a hardcoded mapping between summary and action spaces
can be used to simplify the task of the AI agent. However, as we have shown, it is no longer required to train in this action space. There is an algorithm (ACER) that can finally bridge the semantic gap between summary and master action spaces without the help of domainspecific code written explicitly for this mapping^{4}^{4}4The design of neural networks in ACER was optimised for the dialogue management task, as described in Section VB1. However, the training algorithm itself remains general.. This has three benefits: first, training on master action space outperforms the mapping based on fixed code, when uncertainty (semantic errors) is involved. Second, it allows us to build a more generally applicable system, with less work required to deploy it in differing domains. Third, it allows us to consider domains that have vastly higher action spaces, even if there is no clear way to convert those action spaces into small summary action spaces (such as a general purpose dialogue system).ACER fits well into other sds research directions too. Successful policy optimisers need to be sample efficient and be able to be trained quickly, to avoid subjecting human users to poor dialogue performance for long. ACER uses experience replay for sample efficiency, together with many methods aimed at reducing bias and variance of the estimator, to achieve quick training.
We introduce some of the many directions in which this work could be continued. Recently, [5]
combined supervised learning (SL) with deep reinforcement learning (DRL) to investigate the performance of an agent bootstrapped with SL and trained further with DRL. The nns of ACER are compatible with that approach. This may decrease the overall interactions required for convergence, as well as increase sample efficiency.
Both of our settings, training on summary and on master action space, considered static action spaces only. Under this framework, the entire policy would have to be retrained if a new action or payload were to be introduced. This could hurt the maintainability of a reallife dialogue system, as it would be expensive to extend the database schema or the list of actions. Ideally, the training algorithm could adapt to such changes made, being able to retain its preexisting knowledge of the old actions and this is an important topic to investigate in the future.
Appendix A CamInfo action space
We define the action space in the CamInfo restaurants domain. Most informationseeking domains have a similar overall architecture.

request + slot where slot is an informable slot such as area, food, or pricerange. This action prompts the user to specify their criteria on a slot, eg. “Which area are you interested in?”

confirm + slot where slot is an informable slot. This action prompts the user to confirm their criteria on a slot that they may or may not have already mentioned. Due to errors accumulating during the decoding pipeline (speech recognition, semantic decoding, belief tracking), the system has to deal with considerable uncertainty, but it can attempt to increase its certainty in the user’s criteria by using a confirm action, eg. “Did you say you want an expensive restaurant?”

select + slot where slot is an informable slot. This action prompts a user to select a value for the slot from a specified list of values. This is less openended than a request action and more openended than a confirm action, eg. “Would you like Indian or Korean food?”.

inform + method + slots action provides information on a restaurant. The associated method specifies how the restaurant to give information on should be chosen. The standard method is to choose the first result in the ontology that matches the user criteria specified so far. The method can also be byname, in which case the system believes that the user asked about a specific restaurant by referring to its name, and information on that restaurant should be provided. If the method is requested, we inform on the same restaurant we informed on last, if it is alternatives then we pick another restaurant that matches the user’s criteria (if possible).
There are several properties of a restaurant, with a binary choice for each of them on whether the system wants to inform on it in a dialogue turn or not. The informable slots for restaurants are: area, food type, description, phone number, pricerange, address, postcode and signature.
We note that some of these slots are also requestable, allowing a user to query a restaurant based on those slots. These slots are area, food type and pricerange. A restaurant also has a name, which we will always inform on. Thus, the system has a choice between different ways it can inform on a restaurant. A specific choice is referred to as the payload of an inform action.

reqmore is a simple action that prompts the user to provide more input.

bye is used to end the call, normally only as a response to the user’s intention to end the call.
For the CamInfo domain, there are inform actions and other actions, making up actions in total. We call this action space the master action space. In the summary space, the inform actions do not specify which slots to inform on, leaving only separate inform actions, and actions in total.
Appendix B Example dialogue
Below is an example dialogue between a user looking for a restaurant with a medium price range, and a system that internally translates between summary and master actions. System responses are written as :
Acknowledgment
The authors would like to thank all members of the Dialogue Systems Group for useful comments and suggestions.
References
 [1] N. Roy, J. Pineau, and S. Thrun, “Spoken dialogue management using probabilistic reasoning,” in Proceedings of SigDial, 2000.
 [2] S. J. Young, “Talking to machines (statistically speaking),” 7th International Conference on Spoken Language Processing, 2002.
 [3] S. Young, M. Gašić, B. Thomson, and J. D. Williams, “Pomdpbased statistical spoken dialog systems: A review,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1160–1179, 2013.
 [4] J. D. Williams, K. Asadi, and G. Zweig, “Hybrid code networks: practical and efficient endtoend dialog control with supervised and reinforcement learning,” Association for Computational Linguistics, pp. 665–677, 2017.
 [5] P.H. Su, P. Budzianowski, S. Ultes, M. Gasic, and S. Young, “Sampleefficient actorcritic reinforcement learning with supervised data for dialogue management,” Proceedings of SigDial, pp. 147–157, 2017.
 [6] M. Fatemi, L. E. Asri, H. Schulz, J. He, and K. Suleman, “Policy networks with twostage training for dialogue systems,” Proceedings of SigDial, pp. 101–110, 2016.
 [7] T.H. Wen, D. Vandyke, N. Mrkšić, M. Gašić, L. M. RojasBarahona, P.H. Su, S. Ultes, and S. Young, “A networkbased endtoend trainable taskoriented dialogue system,” in EACL, 2017.
 [8] T. Zhao and M. Eskenazi, “Towards endtoend learning for dialog state tracking and management using deep reinforcement learning,” in Proceedings of SigDial, 2016, pp. 1–10.
 [9] B. Dhingra, L. Li, X. Li, J. Gao, Y.N. Chen, F. Ahmed, and L. Deng, “Towards endtoend reinforcement learning of dialogue agents for information access,” in Association for Computational Linguistics, 2017, pp. 484–495.

[10]
X. Li, Y.N. Chen, L. Li, J. Gao, and A. Celikyilmaz, “Endtoend
taskcompletion neural dialogue systems,” in
International Joint Conference on Natural Language Processing
, 2017, pp. 733–743.  [11] B. Liu and I. Lane, “Iterative policy learning in endtoend trainable taskoriented neural dialog models,” IEEE Automatic Speech Recognition and Understanding Workshop, 2017.
 [12] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample efficient actorcritic with experience replay,” International Conference on Learning Representations, 2017.
 [13] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare, “Safe and efficient offpolicy reinforcement learning,” in Advances in Neural Information Processing Systems, 2016, pp. 1046–1054.
 [14] H. Bunt, “Context and dialogue control,” THINK Quarterly, vol. 3, 1994.
 [15] D. R. Traum, “20 questions on dialogue act taxonomies,” JOURNAL OF SEMANTICS, vol. 17, pp. 7–30, 2000.
 [16] M. Henderson, B. Thomson, and S. J. Young, “Deep neural network approach for the dialog state tracking challenge.” in Proceedings of SigDial, 2013, pp. 467–471.
 [17] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial intelligence, vol. 101, no. 1, pp. 99–134, 1998.
 [18] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1.
 [19] M. Gašić, S. Keizer, F. Mairesse, J. Schatzmann, B. Thomson, K. Yu, and S. Young, “Training and evaluation of the his pomdp dialogue system in noise,” Proceedings of SigDial, pp. 112–119, 2008.
 [20] S. Chandramohan, M. Geist, and O. Pietquin, “Optimizing spoken dialogue management with fitted value iteration,” InterSpeech, pp. 86–89, 2010.
 [21] M. Gasic and S. Young, “Gaussian processes for pomdpbased dialogue manager optimization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 28–40, 2014.
 [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [23] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
 [24] H. Cuayáhuitl, S. Keizer, and O. Lemon, “Strategic dialogue management via deep reinforcement learning,” NIPS Workshop on Deep Reinforcement Learning, 2015.
 [25] Z. C. Lipton, J. Gao, L. Li, X. Li, F. Ahmed, and L. Deng, “Efficient exploration for dialogue policy learning with bbq networks replay buffer spiking,” NIPS Workshop on Deep Reinforcement Learning, 2016.
 [26] J. D. Williams and G. Zweig, “Endtoend lstmbased dialog control optimized with supervised and reinforcement learning,” arXiv preprint arXiv:1606.01269, 2016.
 [27] K. Asadi and J. D. Williams, “Sampleefficient deep reinforcement learning for dialog control,” arXiv preprint arXiv:1612.06000, 2016.
 [28] L.J. Lin, “Selfimproving reactive agents based on reinforcement learning, planning and teaching,” Machine learning, vol. 8, no. 34, pp. 293–321, 1992.
 [29] T. Degris, M. White, and R. S. Sutton, “Offpolicy actorcritic,” International Conference on Machine Learning, pp. 179–186, 2012.
 [30] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour et al., “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, vol. 99, 1999, pp. 1057–1063.
 [31] D. Precup, R. S. Sutton, and S. Dasgupta, “Offpolicy temporaldifference learning with function approximation,” International Conference on Machine Learning, pp. 417–424, 2001.
 [32] B. Thomson, Statistical methods for spoken dialogue management. Springer Science & Business Media, 2013.
 [33] R. Pascanu and Y. Bengio, “Revisiting natural gradient for deep networks,” International Conference on Learning Representations, 2014.
 [34] J. Peters and S. Schaal, “Natural actorcritic,” Neurocomputing, vol. 71, no. 7, pp. 1180–1190, 2008.
 [35] J. Kober and J. R. Peters, “Policy search for motor primitives in robotics,” in Advances in Neural Information Processing Systems, 2009, pp. 849–856.
 [36] W. Gellért, “Sample efficient deep reinforcement learning for dialogue systems with large action spaces,” MPhil Thesis, University of Cambridge, 2017.
 [37] S. Ultes, L. M. RojasBarahona, P.H. Su, D. Vandyke, D. Kim, I. Casanueva, P. Budzianowski, N. Mrkšić, T.H. Wen, M. Gašić, and S. J. Young, “Pydial: A multidomain statistical dialogue system toolkit,” in ACL Demo. Association of Computational Linguistics, 2017.
 [38] J. Peters and S. Schaal, “Policy gradient methods for robotics,” in Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on. IEEE, 2006, pp. 2219–2225.
 [39] J. Schatzmann, B. Thomson, K. Weilhammer, H. Ye, and S. Young, “Agendabased user simulation for bootstrapping a pomdp dialogue system,” The Conference of the North American Chapter of the Association for Computational Linguistics, pp. 149–152, 2007.
 [40] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 2015.
 [41] X. Li, Y. Chen, L. Li, J. Gao, and A. Çelikyilmaz, “Investigation of language understanding impact for reinforcement learning based dialogue systems,” CoRR, vol. abs/1703.07055, 2017.
 [42] F. Jurčíček, S. Keizer, M. Gašić, F. Mairesse, B. Thomson, K. Yu, and S. Young, “Real user evaluation of spoken dialogue systems using Amazon Mechanical Turk,” in INTERSPEECH, 2011.