1 Introduction
Taskoriented Spoken Dialogue Systems (SDS) aim to assist users to achieve specific goals via speech, such as hotel booking, restaurant information and accessing busschedules. These systems are typically designed according to a structured ontology (or a database schema), which defines the domain that the system can talk about. The development of a robust SDS traditionally requires a substantial amount of handcrafted rules combined with various statistical components. This includes a spoken language understanding module Chen et al. (2016); Yang et al. (2017), a dialogue belief state tracker Henderson et al. (2014); Perez and Liu (2016); Mrkšić et al. (2017) to predict user intent and track the dialogue history, a dialogue policy Young et al. (2013); Gašić and Young (2014); Budzianowski et al. (2017)
to determine the dialogue flow, and a natural language generator
Rieser and Lemon (2009); Wen et al. (2015); Hu et al. (2017) to convert conceptual representations into system responses.In a taskoriented SDS, teaching a system how to respond appropriately in all situations is nontrivial. Traditionally, this dialogue management component has been designed manually using flow charts. More recently, it has been formulated as a planning problem and solved using reinforcement learning (RL) to optimise a dialogue policy through interaction with users Levin and Pieraccini (1997); Roy et al. (2000); Williams and Young (2007); Jurčíček et al. (2011). In this framework, the system learns by a trial and error process governed by a potentially delayed learning objective called the reward. This reward is designed to encapsulate the desired behavioural features of the dialogue. Typically it provides a positive reward for success plus a per turn penalty to encourage short dialogues El Asri et al. (2014); Su et al. (2015a); Vandyke et al. (2015); Su et al. (2016b).
To allow the system to be trained online, Bayesian sampleefficient learning algorithms have been proposed Gašić and Young (2014); Daubigney et al. (2014) which can learn policies from a minimal number of dialogues. However, even with such methods, the initial performance is still relatively poor, and this can impact negatively on the user experience.
Supervised learning (SL) can also be used for dialogue action selection. In this case, the policy is trained to produce an appropriate response for any given dialogue state. WizardofOz (WoZ) methods Kelley (1984); Dahlbäck et al. (1993) have been widely used for collecting domainspecific training corpora. Recently an emerging line of research has focused on training neural networkbased dialogue models, mostly in textbased systems Vinyals and Le (2015); Shang et al. (2015); Serban et al. (2015); Wen et al. (2017); Bordes et al. (2017). These systems are directly trained on past dialogues without detailed specification of the internal dialogue state. However, there are two key limitations of using SL in SDS. Firstly, the effect of selecting an action on the future course of the dialogue is not considered and this may result in suboptimal behaviour. Secondly, there will often be a large number of dialogue states which are not covered by the training data Henderson et al. (2008); Li et al. (2014). Moreover, there is no reason to suppose that the recorded dialogue participants are acting optimally, especially in high noise levels. These problems are exacerbated in larger domains where multistep planning is needed.
In this paper, we propose a networkbased approach to policy learning which combines the best of both SL and RLbased dialogue management, and which capitalises on recent advances in deep RL Mnih et al. (2015), especially offpolicy algorithms Wang et al. (2017).
The main contribution of this paper is twofold:

improving the sampleefficiency of actorcritic RL: trust region actorcritic with experience replay (TRACER) and episodic natural actorcritic with experience replay (eNACER).

efficient utilisation of demonstration data for improved early stage policy learning.
The first part focusses primarily on increasing the RL learning speed. For TRACER, trust regions are introduced to standard actorcritic to control the step size and thereby avoid catastrophic model changes. For eNACER, the natural gradient identifies steepest ascent direction in policy space to ensure fast convergence. Both models exploit the offpolicy learning with experience replay (ER) to improve sampleefficiency. These are compared with various stateoftheart RL methods.
The second part aims to mitigate the cold start issue by using demonstration data to pretrain an RL model. This resembles the training procedure adopted in recent game playing applications Silver et al. (2016); Hester et al. (2017). A key feature of this framework is that a single model is trained using both SL and RL with different training objectives but without modifying the architecture.
By combining the above, we demonstrate a practical approach to learning deep RLbased dialogue policies for new domains which can achieve competitive performance without significant detrimental impact on users.
2 Related Work
RLbased approaches to dialogue management have been actively studied for some time Levin et al. (1998); Lemon et al. (2006); Gašić and Young (2014). Initially, systems suffered from slow training, but recent advances in data efficient methods such as Gaussian Processes (GP) have enabled systems to be trained from scratch in online interaction with real users Gašić et al. (2011)
. GP provides an estimate of the uncertainty in the underlying function and a builtin noise model. This helps to achieve highly sampleefficient exploration and robustness to recognition/understanding errors.
However, since the computation in GP scales with the number of points memorised, sparse approximation methods such as the kernel span algorithm Engel (2005) must be used and this limits the ability to scale to very large training sets. It is therefore questionable as to whether GP can scale to support commercial widedomain SDS. Nevertheless, GP provides a good benchmark and hence it is included in the evaluation below.
In addition to increasing the sampleefficiency of the learning algorithms, the use of reward shaping has also been investigated in El Asri et al. (2014); Su et al. (2015b) to enrich the reward function in order to speed up dialogue policy learning.
Combining SL with RL for dialogue modelling is not new. henderson2008hybrid proposed a hybrid SL/RL model that, in order to ensure tractability in policy optimisation, performed exploration only on the states in a dialogue corpus. The policy was then defined manually on parts of the space which were not found in the corpus. A method of initialising RL models using logistic regression was also described
Rieser and Lemon (2006). For GPRL in dialogue, rather than using a linear kernel that imposes heuristic data pair correlation, a preoptimised Gaussian kernel learned using SL from a dialogue corpus has been proposed
Chen et al. (2015). The resulting kernel was more accurate on data correlation and achieved better performance, however, the SL corpus did not help to initialise a better policy. Better initialisation of GPRL has been studied in the context of domain adaptation by specifying a GP prior or reusing an existing model which is then pretrained for the new domain Gašić et al. (2013).A number of authors have proposed training a standard neuralnetwork policy in two stages Fatemi et al. (2016); Su et al. (2016a); Williams et al. (2017). asadi2016sample also explored offpolicy RL methods for dialogue policy learning. All these studies were conducted in simulation, using errorfree textbased input. A similar approach was also used in a conversational model Li et al. (2016). In contrast, our work introduces two new sampleefficient actorcritic methods, combines both twostage policy learning and offpolicy RL, and testing at differing noise levels.
3 Neural Dialogue Management
The proposed framework addresses the dialogue management component in a modular SDS. The input to the model is the belief state that encodes a distribution over the possible user intents along with the dialogue history. The model’s role is to select the system action at every turn that will lead to the maximum possible cumulative reward and a successful dialogue outcome. The system action is mapped into a system reply at the semantic level, and this is subsequently passed to the natural language generator for output to the user.
The semantic reply consists of three parts: the intent of the response, (e.g. inform), which slots to talk about (e.g. area), and a value for each slot (e.g. east). To ensure tractability, the policy selects from a restricted action set which identifies the intent and sometimes a slot, any remaining information required to complete the reply is extracted using heuristics from the tracked belief state.
3.1 Training with Reinforcement Learning
Dialogue policy optimisation can be seen as the task of learning to select the sequence of responses (actions) at each turn which maximises the longterm objective defined by the reward function. This can be solved by applying either valuebased or policybased methods. In both cases, the goal is to find an optimal policy that maximises the discounted total return over a dialogue with turns where is the reward when taking action in dialogue belief state at turn and is the discount factor.
The main difference between the two categories is that policybased methods have stronger convergence characteristics than valuebased methods. The latter often diverge when using function approximation since they optimise in value space and a slight change in value estimate can lead to a large change in policy space Sutton et al. (2000).
Policybased methods suffer from low sampleefficiency, high variance and often converge to local optima since they typically learn via Monte Carlo estimation
Williams (1992); Schulman et al. (2016). However, they are preferred due to their superior convergence properties. Hence in this paper we focus on policybased methods but also include a valuebased method as a baseline.3.1.1 Advantage ActorCritic (A2C)
In a policybased method, the training objective is to find a parametrised policy that maximises the expected reward over all possible dialogue trajectories given a starting state.
Following the Policy Gradient Theorem Sutton et al. (2000), the gradient of the parameters given the objective function has the form:
(1) 
Since this form of gradient has a potentially high variance, a baseline function is typically introduced to reduce the variance whilst not changing the estimated gradient Williams (1992); Sutton and Barto (1999). A natural candidate for this baseline is the value function . Equation 2 then becomes:
(2) 
where is the advantage function. This can be viewed as a special case of the actorcritic, where is the actor and is the critic, defined by two parameter sets and . To reduce the number of required parameters, temporal difference (TD) errors can be used to approximate the advantage function Schulman et al. (2016). The left part in Figure 1 shows the architecture and parameters of the resulting A2C policy.
3.1.2 The TRACER Algorithm
To boost the performance of A2C policy learning, two methods are introduced:

Experience replay with offpolicy learning for speedup
Onpolicy RL methods update the model with the samples collected via the current policy. Sampleefficiency can be improved by utilising experience replay (ER) Lin (1992), where minibatches of dialogue experiences are randomly sampled from a replay pool to train the model. This increases learning efficiency by reusing past samples in multiple updates whilst ensuring stability by reducing the data correlation. Since these past experiences were collected from different policies compared to the current policy, the use of ER leads to offpolicy updates.
When training models with RL,
greedy action selection is often used to tradeoff between exploration and exploitation, whereby a random action is chosen with probability
otherwise the topranking action is selected. A policy used to generate a training dialogues (episodes) is referred to as a behaviour policy , in contrast to the policy to be optimised which is called the target policy .The basic A2C training algorithm described in §3.1.1 is onpolicy since it is assumed that actions are drawn from the same policy as the target to be optimised (). In offpolicy learning, since the current policy is updated with the samples generated from old behaviour policies , an importance sampling (IS) ratio is used to rescale each sampled reward to correct for the sampling bias at timestep : Meuleau et al. (2000).
For A2C, the offpolicy gradient for the parametrised value function thus has the form:

(3) 
where is the offpolicy MonteCarlo return Precup et al. (2001):
(4) 
Likewise, the updated gradient for policy is:
(5) 
where is the TD error using the estimated value of .
Also, as the gradient correlates strongly with the sampled reward, reward and total return are normalised to lie in [1,1] to stabilise training.

Trust region constraint for stabilisation
To ensure stability in RL, each perstep policy change is often limited by setting a small learning rate. However, setting the rate low enough to avoid occasional large destabilising updates is not conducive to fast learning.
Here, we adopt a modified Trust Region Policy Optimisation method introduced by wang2016sample. In addition to maximising the cumulative reward , the optimisation is also subject to a KullbackLeibler (KL) divergence limit between the updated policy and an average policy to ensure safety. This average policy represents a running average of past policies and constrains the updated policy to not deviate far from the average with a weight .
Thus, given the offpolicy policy gradient in Equation 5, the modified policy gradient with trust region is calculated as follows:
subject to 
where is the policy parametrised by or , and controls the magnitude of the KL constraint. Since the constraint is linear, a closed form solution to this quadratic programming problem can be derived using the KKT conditions. Setting , we get:
(6) 
When this constraint is satisfied, there is no change to the gradient with respect to . Otherwise, the update is scaled down along the direction of and the policy change rate is lowered. This direction is also shown to be closely related to the natural gradient Amari (1998); Schulman et al. (2015), which is presented in the next section.
The above enhancements speed up and stabilise A2C. We call it the Trust Region ActorCritic with Experience Replay (TRACER) algorithm.
3.1.3 The eNACER Algorithm
Vanilla gradient descent algorithms are not guaranteed to update the model parameters in the steepest direction due to reparametrisation Amari (1998); Martens (2014). A widely used solution to this problem is to use a compatible function approximation for the advantage function in Equation 2: , where the update of is then in the same update direction as Sutton et al. (2000). Equation 2 can then be rewritten as:
where is the Fisher information matrix. This implies and it is called the natural gradient. The Fisher Matrix can be viewed as a correction term which makes the natural gradient independent of the parametrisation of the policy and corresponds to steepest ascent towards the objective Martens (2014). Empirically, the natural gradient has been found to significantly speed up convergence.
Based on these ideas, the Natural ActorCritic (NAC) algorithm was developed by peters2006policy. In its episodic version (eNAC), the Fisher matrix does not need to be explicitly computed. Instead, the gradient is estimated by a least squares method given the th episode consisting of a set of transition tuples :
(7) 
which can be solved analytically. is a constant which is an estimate of the baseline .
As in TRACER, eNAC can be enhanced by using ER and offpolicy learning, thus called eNACER, whereby in Equation 7 is replaced by the offpolicy MonteCarlo return at timestep as in Equation 4. For very large models, the inversion of the Fisher matrix can become prohibitively expensive to compute. Instead, a truncated variant can be used to calculate the natural gradient Schulman et al. (2015).
eNACER is structured as a feed forward network with the output as in the right of Figure 1, updated with natural gradient . Note that by using the compatible function approximation, the value function does not need to be explicitly calculated. This makes eNACER in practice a policygradient method.
3.2 Learning from Demonstration Data
From the user’s perspective, performing RL from scratch will invariably result in unacceptable performance in the early learning stages. This problem can be mitigated by an offline corpus of demonstration data to bootstrap a policy. This data may come from a WoZ collection or from interactions between users and an existing policy. It can be used in three ways: A: Pretrain the model, B: Initialise a supervised replay buffer , and C: a combination of the two.
(A) For model pretraining, the objective is to ‘mimic’ the response behaviour from the corpus. This phase is essentially standard SL. The input to the model is the dialogue belief state , and the training objective for each sample is to minimise a joint crossentropy loss between action labels and model predictions , where the policy is parametrised by a set .
A policy trained by SL on a fixed dataset may not generalise well. In spoken dialogues, the noise levels may vary across conditions and thus can significantly affect performance. Moreover, a policy trained using SL does not perform any longterm planning on the conversation. Nonetheless, supervised pretraining offers a good model starting point which can then be finetuned using RL.
(B) For supervised replay initialisation, the demonstration data is stored in a replay pool which is kept separate from the ER pool used for RL and is never overwritten. At each RL update iteration, a small portion of the demonstration data is sampled, and the supervised crossentropy loss computed on this data is added to the RL objective . Also, an L2 regularisation loss is applied to to help prevent it from overfitting on the sampled demonstration dataset. The total loss to be minimised is thus:
(8) 
where ’s are weights. In this way, the RL policy is guided by the sampled demonstration data while learning to optimise the total return.
(C) The learned parameters of the pretrained model in method A above might distribute differently from the optimal RL policy and this may cause some performance drop in early stages while learning an RL policy from this model. This can be alleviated by using the composite loss proposed in method B. A comparison between the three options is included in the experimental evaluation.
4 Experimental Results
Our experiments utilised the software toolkit PyDial Ultes et al. (2017), which provides a platform for modular SDS. The target application is a live telephonebased SDS providing restaurant information for the Cambridge (UK) area. The task is to learn a policy which manages the dialogue flow and delivers requested information to the user. The domain consists of approximately 100 venues, each with 6 slots out of which 3 can be used by the system to constrain the search (foodtype, area and pricerange) and 3 are systeminformable properties (phonenumber, address and postcode) available once a database entity has been found.
The input for all models was the full dialogue belief state of size 268 which includes the last system act and distributions over the user intention and the three requestable slots. The output includes 14 restricted dialogue actions determining the system intent at the semantic level. Combining the dialogue belief states and heuristic rules, it is then mapped into a spoken response using a natural language generator.
4.1 Model Comparison
Two valuebased methods are shown for comparison with the policybased models described. For both of these, the policy is implicitly determined by the actionvalue (Q) function which estimates the expected total return when choosing action given belief state at timestep . For an optimal policy , the Qfunction satisfies the Bellman equation Bellman (1954):

(9) 
4.1.1 Deep QNetwork (DQN)
DQN is a variant of the Qlearning algorithm whereby a neural network is used to nonlinearly approximate the Qfunction. This suggests a sequential approximation in Equation 9 by minimising the loss:
(10) 
where is the target to update the parameters . Note that is evaluated by a target network which is updated less frequently than the network to stabilise learning, and the expectation is over the tuples sampled from the experience replay pool described in §3.1.2.
DQN often suffers from overestimation on Qvalues as the operator is used to select an action as well as to evaluate it. Double DQN (DDQN) Van Hasselt et al. (2016) is thus used to decouple the action selection and Qvalue estimation to achieve better performance.
4.1.2 Gaussian Processes (GP) RL
GPRL is a stateoftheart valuebased RL algorithm for dialogue modelling. It is appealing since it can learn from a small number of observations by exploiting the correlations defined by a kernel function and provides an uncertainty measure of its estimates. In GPRL, the function is modelled as a GP with zero mean and kernel: . This Qfunction is then updated by calculating the posterior given the collected beliefaction pairs (dictionary points) and their corresponding rewards Gašić and Young (2014). The implicit knowledge of the distance between data points in observation space provided by the kernel greatly speeds up learning since it enables Qvalues in as yet unexplored space to be estimated. Note that GPRL was used by fatemi2016policy to compare with deep RL but no uncertainty estimate was used to guide exploration and as a result had relatively poor performance. Here GPRL with uncertainty estimate is used as the benchmark.
4.2 Reinforcement Learning from Scratch
The proposed models were first evaluated under 0% semantic error rate with an agendabased simulator which generates user interactions at the semanticlevel Schatzmann et al. (2006). In this case, the user intent is perfectly captured in the dialogue belief state without noise.
The total return of each dialogue was set to , where is the dialogue length and is the success indicator for dialogue . The maximum dialogue length was set to 20 turns and was 0.99. All deep RL models (A2C, TRACER, eNACER and DQN) contained two hidden layers of size 130 and 50. The Adam optimiser was used Kingma and Ba (2014) with an initial learning rate of 0.001. During training, an greedy policy was used, which was initially set to 0.3 and annealed to 0.0 over 3500 training dialogues. For GP, a linear kernel was used.
The ER pool size was 1000, and the minibatch size was 64. Once an initial 192 samples had been collected, the model was updated after every 2 dialogues. Note that for DQN, each sample was a state transition , whereas in A2C, TRACER and eNACER, each sample comprised the whole dialogue with all its state transitions. For eNACER, the natural gradient was computed to update the model weights of size 42000. For TRACER, was set to 0.02, and was 0.01. Since the IS ratio has a high variance and can occasionally be extremely large, it was clipped between [0.8,1.0] to maintain stable training.
Figure 2 shows the success rate learning curves of onpolicy A2C, A2C with ER, TRACER, DQN with ER, GP and eNACER. All were tested with 600 dialogues after every 200 training dialogues. As reported in previous studies, the benchmark GP model learns quickly and is relatively stable. eNACER provides comparable performance. DQN also showed high sampleefficiency but with high instability at some points. This is because an iterative improvement in value space does not guarantee an improvement in policy space. Although comparably slower to learn, the difference between onpolicy A2C and A2C with ER clearly demonstrates the sampleefficiency of reusing past samples in minibatches. The enhancements incorporated into the TRACER algorithm do make this form of learning competitive although it still lags behind eNACER and GPRL.
4.2.1 Learning from Demonstration Data
Regardless of the choice of model and learning algorithm, training a policy from scratch online will always result in a poor user experience until sufficient interactions have been experienced to allow acceptable behaviours to be learned.
As discussed in §3.2, an offline corpus of demonstration data can potentially mitigate this problem. To test this, a corpus of 720 real user spoken dialogues in the Cambridge restaurant domain was utilised. The corpus was split in a 4:1:1 ratio for training, validation and testing. It contains interactions between real users recruited via the Amazon Mechanical Turk service and a wellbehaved SDS as described in su:2016:acl.
For A2C with ER and TRACER, the three ways of exploiting demonstration data in §3.2 were explored. The exploration parameter was also set to 0.3 and annealed to 0.0 over 2000 training dialogues. Since TRACER has similar patterns to A2C with ER, we first explored the impact of demonstration data on the A2C with ER results since it provides more headroom for identifying performance gains.
Figure 2(a) shows the different combinations of demonstration data using A2C with ER in noisefree conditions. The supervised pretrained model (SL model) provides reasonable starting performance. The A2C ER model with supervised pretraining (A2C ER+SL_model) improves on this after only 400 dialogues whilst suffering initially. We hypothesise that the optimised SL pretrained parameters distributed very differently to the optimal A2C ER parameters. Also, the A2C ER model with SL replay (A2C ER+SL_replay) shows clearly how the use of a supervised replay buffer can accelerate learning from scratch. Moreover, when SL pretraining is combined with SL replay (A2C ER+SL_model+replay), it achieved the best result. Note that and in Equation 8 were 10 and 0.01 respectively. In each policy update, 64 demonstration data were randomly sampled from the supervised replay pool , which is the same number of RL samples selected from ER for A2C learning. Similar patterns emerge when utilising demonstration data to improve early learning in the TRACER and eNACER algorithms as shown in Figure 2(b). However, in this case, eNACER is less able to exploit demonstration data since the training method is different from standard actorcritics. Hence, the supervised loss cannot be directly incorporated into the RL objective as in Equation 8. One could optimise the model using separately after every RL update. However, in our experiments, this did not yield improvement. Hence, only eNACER learning from a pretrained SL model is reported here. Compared to eNACER learning from scratch, eNACER from SL model started with good performance but learned more slowly. Again, this may be because the optimised SL pretrained parameters distributed very differently from the optimal eNACER parameters and led to suboptimality. Overall, these results suggest that the proposed SL+RL framework to exploit demonstration data is effective in mitigating the cold start problem and TRACER provides the best solution in terms of avoiding poor initial performance, rapid learning and competitive fully trained performance.
In addition to the noisefree performance, we also investigated the impact of noise on the TRACER algorithm. Figure 4 shows the results after training on dialogues via interaction with the user simulator under different semantic error rates. The random policy (white bars) uniformly sampled an action from the set of size 14. This can be regarded as the average initial performance of any learning system. We can see that SL generates a robust model which can be further finetuned using RL over a wide range of error rates. It should be noted, however, that the dropoff in performance at high noise levels is more rapid than might be expected, comparing to the GPRL. We believe that deep architectures are prone to overfitting and in consequence do not handle well the uncertainty of the user behaviour. We plan to investigate this issue in future work. Overall, these outcomes validate the benefit of the proposed twophased approach where the system can be effectively pretrained using corpus data and further be refined via user interactions.
5 Conclusion
This paper has presented two compatible approaches to tackling the problem of slow learning and poor initial performance in deep reinforcement learning algorithms. Firstly, trust region actorcritic with experience replay (TRACER) and episodic natural actorcritic with experience replay (eNACER) were presented, these have been shown to be more sampleefficient than other deep RL models and broadly competitive with GPRL. Secondly, it has been shown that demonstration data can be utilised to mitigate poor performance in the early stages of learning. To this end, two methods for using offline corpus data were presented: simple pretraining using SL, and using the corpus data in a replay buffer. These were particularly effective when used with TRACER which provided the best overall performance.
Experimental results were also presented for mismatched environments, again TRACER demonstrated the ability to avoid poor initial performance when trained only on the demonstration corpus, yet still improve substantially with subsequent reinforcement learning. It was noted, however, that performance still falls off rather rapidly in noise compared to GPRL as the uncertainty estimates are not handled well by neural networks architectures.
Finally, it should be emphasised that whilst this paper has focused on the early stages of learning a new domain where GPRL provides a benchmark and is hard to beat, the potential of deep RL is its readily scalability to exploit online learning with large user populations as the model size is not related with experience replay buffer.
Acknowledgments
PeiHao Su is supported by Cambridge Trust and the Ministry of Education, Taiwan. Paweł Budzianowski is supported by EPSRC Council and Toshiba Research Europe Ltd, Cambridge Research Laboratory. The authors would like to thank the other members of the Cambridge Dialogue Systems Group for their valuable comments.
References
 Amari (1998) ShunIchi Amari. 1998. Natural gradient works efficiently in learning. In Neural computation. MIT Press, volume 10, pages 251–276.
 Asadi and Williams (2016) Kavosh Asadi and Jason D Williams. 2016. Sampleefficient deep reinforcement learning for dialog control. In arXiv preprint arXiv:1612.06000.
 Bellman (1954) Richard Bellman. 1954. The theory of dynamic programming. Technical report, DTIC Document.
 Bordes et al. (2017) Antoine Bordes, YLan Boureau, and Jason Weston. 2017. Learning endtoend goaloriented dialog. In Proc of ICLR.
 Budzianowski et al. (2017) Paweł Budzianowski, Stefan Ultes, PeiHao Su, Nikola Mrkšić, TsungHsien Wen, Inigo Casanueva, Lina M. Rojas Barahona, and Milica Gašić. 2017. Subdomain modelling for dialogue management with hierarchical reinforcement learning. In Proc of SIGDIAL.
 Chen et al. (2015) Lu Chen, PeiHao Su, and Milica Gašic. 2015. Hyperparameter optimisation of gaussian process reinforcement learning for statistical dialogue management. In Proc of SigDial.
 Chen et al. (2016) YunNung Chen, Dilek HakkaniTür, Gokhan Tur, Jianfeng Gao, and Li Deng. 2016. Endtoend memory networks with knowledge carryover for multiturn spoken language understanding. In Proc of INTERSPEECH.
 Dahlbäck et al. (1993) Nils Dahlbäck, Arne Jönsson, and Lars Ahrenberg. 1993. Wizard of oz studies: why and how. In Proc of Intelligent user interfaces.
 Daubigney et al. (2014) Lucie Daubigney, Matthieu Geist, Senthilkumar Chandramohan, and Olivier Pietquin. 2014. A comprehensive reinforcement learning framework for dialogue management optimisation. volume 6.

El Asri et al. (2014)
Layla El Asri, Romain Laroche, and Olivier Pietquin. 2014.
Task completion transfer learning for reward inference.
In Proc of MLIS.  Engel (2005) Yaakov Engel. 2005. Algorithms and representations for reinforcement learning. PhD Thesis.
 Fatemi et al. (2016) Mehdi Fatemi, Layla El Asri, Hannes Schulz, Jing He, and Kaheer Suleman. 2016. Policy networks with twostage training for dialogue systems. In Proc of SigDial.
 Gašić et al. (2013) Milica Gašić, Catherine Breslin, Matt Henderson, Dongho Kim, Martin Szummer, Blaise Thomson, Pirros Tsiakoulis, and Steve Young. 2013. Pomdpbased dialogue manager adaptation to extended domains. In Sigdial.
 Gašić et al. (2011) Milica Gašić, Filip Jurcicek, Blaise. Thomson, Kai Yu, and Steve Young. 2011. Online policy optimisation of spoken dialogue systems via live interaction with human subjects. In IEEE ASRU.
 Gašić and Young (2014) Milica Gašić and Steve Young. 2014. Gaussian processes for pomdpbased dialogue manager optimization. IEEE, volume 22, pages 28–40.
 Henderson et al. (2008) James Henderson, Oliver Lemon, and Kallirroi Georgila. 2008. Hybrid reinforcement/supervised learning of dialogue policies from fixed data sets. In Computational Linguistics. MIT Press, volume 34, pages 487–511.

Henderson et al. (2014)
M. Henderson, B. Thomson, and S. J. Young. 2014.
Wordbased Dialog State Tracking with Recurrent Neural Networks.
In Proc of SIGdial.  Hester et al. (2017) Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot andTom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel DulacArnold, Ian Osband, John Agapiou, Joel Z. Leibo, and Audrunas Gruslys. 2017. Learning from demonstrations for real world reinforcement learning. In arXiv:1704.03732.
 Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. 2017. Controllable text generation. Proc of ICML .
 Jurčíček et al. (2011) Filip Jurčíček, Blaise Thomson, and Steve Young. 2011. Natural actor and belief critic: Reinforcement algorithm for learning parameters of dialogue systems modelled as pomdps. In ACM TSLP. ACM, volume 7, page 6.
 Kelley (1984) John F. Kelley. 1984. An iterative design methodology for userfriendly natural language office information applications.
 Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In arXiv preprint arXiv:1412.6980.
 Lemon et al. (2006) Oliver Lemon, Kallirroi Georgila, and James Henderson. 2006. Evaluating effectiveness and portability of reinforcement learned dialogue strategies with real users: the talk towninfo evaluation. In SLT. pages 178–181.
 Levin and Pieraccini (1997) Esther Levin and Roberto Pieraccini. 1997. A stochastic model of computerhuman interaction for learning dialogue strategies. In Eurospeech.

Levin et al. (1998)
Esther Levin, Roberto Pieraccini, and Wieland Eckert. 1998.
Using markov decision process for learning dialogue strategies.
In ICASSP.  Li et al. (2016) Jiwei Li, Will Monroe, Alan Ritter, and Dan Jurafsky. 2016. Deep reinforcement learning for dialogue generation. In Proc of EMNLP.
 Li et al. (2014) Lihong Li, He He, and Jason D Williams. 2014. Temporal supervised learning for inferring a dialog policy from example conversations. In Spoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, pages 312–317.
 Lin (1992) LongJi Lin. 1992. Selfimproving reactive agents based on reinforcement learning, planning and teaching. In Machine learning. volume 8, pages 293–321.
 Martens (2014) James Martens. 2014. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193 .
 Meuleau et al. (2000) Nicolas Meuleau, Leonid Peshkin, Leslie P Kaelbling, and KeeEung Kim. 2000. Offpolicy policy search. In Technical report, MIT AI Lab.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Humanlevel control through deep reinforcement learning. In Nature. Nature Publishing Group, volume 518, pages 529–533.
 Mrkšić et al. (2017) Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, TsungHsien Wen, and Steve Young. 2017. Neural Belief Tracker: Datadriven dialogue state tracking. In Proc of ACL.
 Perez and Liu (2016) Julien Perez and Fei Liu. 2016. Dialog state tracking, a machine reading approach using memory network. arXiv preprint arXiv:1606.04052 .
 Peters and Schaal (2006) Jan Peters and Stefan Schaal. 2006. Policy gradient methods for robotics. In IEEE RSJ.
 Precup et al. (2001) Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. 2001. Offpolicy temporaldifference learning with function approximation. In Proc of ICML.
 Rieser and Lemon (2006) Verena Rieser and Oliver Lemon. 2006. Using logistic regression to initialise reinforcementlearningbased dialogue systems. In Spoken Language Technology Workshop, 2006. IEEE. IEEE, pages 190–193.
 Rieser and Lemon (2009) Verena Rieser and Oliver Lemon. 2009. Natural language generation as planning under uncertainty for spoken dialogue systems. In Proc of EACL. pages 683–691.
 Roy et al. (2000) Nicholas Roy, Joelle Pineau, and Sebastian Thrun. 2000. Spoken dialogue management using probabilistic reasoning. In Proc of SigDial.

Schatzmann et al. (2006)
Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. 2006.
A survey of statistical user simulation techniques for
reinforcementlearning of dialogue management strategies.
In
The knowledge engineering review
. Cambridge Univ Press, volume 21, pages 97–126.  Schulman et al. (2015) John Schulman, Sergey Levine, Philipp Moritz, Michael I Jordan, and Pieter Abbeel. 2015. Trust region policy optimization. In Proc of ICML.
 Schulman et al. (2016) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2016. Highdimensional continuous control using generalized advantage estimation. In Proc of ICLR.
 Serban et al. (2015) Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2015. Hierarchical neural network generative models for movie dialogues. In arXiv preprint arXiv:1507.04808.
 Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for shorttext conversation. In arXiv preprint arXiv:1503.02364.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of go with deep neural networks and tree search. In Nature. Nature Publishing Group, volume 529, pages 484–489.
 Su et al. (2016a) PeiHao Su, Milica Gasic, Nikola Mrksic, Lina RojasBarahona, Stefan Ultes, David Vandyke, TsungHsien Wen, and Steve Young. 2016a. Continuously learning neural dialogue management. In arXiv preprint arXiv:1606.02689.
 Su et al. (2016b) PeiHao Su, Milica Gašić, Nikola Mrkšić, Lina RojasBarahona, Stefan Ultes, David Vandyke, TsungHsien Wen, and Steve Young. 2016b. Online active reward learning for policy optimisation in spoken dialogue systems. In Proc of ACL.
 Su et al. (2015a) PeiHao Su, David Vandyke, Milica Gašić, Dongho Kim, Nikola Mrkšić, TsungHsien Wen, and Steve Young. 2015a. Learning from real users: Rating dialogue success with neural networks for reinforcement learning in spoken dialogue systems. In Proc of Interspeech.
 Su et al. (2015b) PeiHao Su, David Vandyke, Milica Gašić, Nikola Mrkšić, TsungHsien Wen, and Steve Young. 2015b. Reward shaping with recurrent neural networks for speeding up online policy learning in spoken dialogue systems. In Proc of SigDial.
 Sutton and Barto (1999) Richard S. Sutton and Andrew G. Barto. 1999. Reinforcement Learning: An Introduction. MIT Press.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. 2000. Policy gradient methods for reinforcement learning with function approximation. In Proc of NIPS.
 Ultes et al. (2017) Stefan Ultes, Lina M. Rojas Barahona, PeiHao Su, David Vandyke, Dongho Kim, Inigo Casanueva, Pawel Budzianowski, Nikola Mrkšić, TsungHsien Wen, Milica Gašić, and Steve Young. 2017. CUPyDial: A Multidomain Statistical Dialogue System Toolkit. In ACL Demo.
 Van Hasselt et al. (2016) Hado Van Hasselt, Arthur Guez, and David Silver. 2016. Deep reinforcement learning with double qlearning. In Proc of AAAI.

Vandyke et al. (2015)
David Vandyke, PeiHao Su, Milica Gašić, Nikola Mrkšić,
TsungHsien Wen, and Steve Young. 2015.
Multidomain dialogue success classifiers for policy training.
In IEEE ASRU.  Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. In arXiv preprint arXiv:1506.05869.
 Wang et al. (2017) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. 2017. Sample efficient actorcritic with experience replay. In Proc of ICLR.
 Wen et al. (2017) TsungHsien Wen, Milica Gašić, Nikola Mrkšić, Lina M. RojasBarahona, PeiHao Su, Stefan Ultes, David Vandyke, and Steve Young. 2017. A networkbased endtoend trainable taskoriented dialogue system. In Proc of EACL.
 Wen et al. (2015) TsungHsien Wen, Milica Gašić, Nikola Mrkšić, PeiHao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstmbased natural language generation for spoken dialogue systems. In EMNLP.
 Williams et al. (2017) Jason D Williams, Kavosh Asadi, and Geoffrey Zweig. 2017. Hybrid code networks: practical and efficient endtoend dialog control with supervised and reinforcement learning. In ACL.
 Williams and Young (2007) Jason D. Williams and Steve Young. 2007. Partially observable Markov decision processes for spoken dialog systems. volume 21, pages 393–422.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Machine learning. Springer, volume 8, pages 229–256.
 Yang et al. (2017) Xuesong Yang, YunNung Chen, Dilek HakkaniTür, Paul Crook, Xiujun Li, Jianfeng Gao, and Li Deng. 2017. Endtoend joint learning of natural language understanding and dialogue manager. In IEEE ICASSP. pages 5690–5694.
 Young et al. (2013) Steve Young, Milica Gašic, Blaise Thomson, and Jason Williams. 2013. Pomdpbased statistical spoken dialogue systems: a review. In Proc of IEEE. volume 99, pages 1–20.