1 Introduction
Deep reinforcement learning (RL) has recently shown impressive successes across a variety of tasks (Mnih et al., 2013; Tesauro, 1995; Silver et al., 2017, 2016; Brown & Sandholm, 2017; Watter et al., 2015; Lillicrap et al., 2015; Schulman et al., 2015; Levine et al., 2016). However, the silver bullet for many of these successes have been enormous amounts of training data and result in policies which do not generalize to changes or novel tasks in the environment. Fewer successes have been achieved outside the realm of simulated environments or environments where agents can play against themselves. Thus, one major impediment towards applying deep RL to realworld problems is a lack of dataefficiency.
One promising solution is modelbased RL, where an internal model of the environment is learned. By learning an internal environment model the agent may be able to exploit structural properties of the environment. This enables the agent to reduce the amount of trialanderror learning and to better generalize across states and actions.
In this paper we propose a modelbased RL method based on learning an approximate, factorized transition model. The approximate transition model involves discrete, abstract states acting as information bottlenecks, which mediate the transitions between successive full states. Once learned, the approximate transition model is then applied to learn the agent’s policy (for example, using Qlearning with rollout simulations). This method has several advantages. First, the factorized model has significantly fewer parameters compared to a nonfactorized transition model, making it highly sample efficient. Second, by learning the abstract state representation with the specific goal of obtaining an optimal policy (as opposed to maximizing the transition model’s predictive accuracy), it may be possible to tradeoff some of the transition model’s predictive power for an improvement in the policy’s performance. By grouping similar states together into the same discrete, abstract state, it may be possible to improve the performance of the policy learned with the approximate transition model.
The idea of grouping similar states together has been proposed before in a variety of forms (e.g. state aggregation (Bean et al., 1987; Bertsekas & Castanon, 1989; Dietterich, 2000; Jong & Stone, 2005; Jiang et al., 2015)
). In contrast to many previous approaches, in our method the grouping is applied exclusively within the approximate transition model, while the agent’s policy still operates on the complete world state. This allows the agent’s policy (e.g. a neural network) to form its own highlevel, distributed representations of the world from the complete world state. Importantly, in this method, the agent’s policy is capable of obtaining better performance compared to standard state aggregation, because it may counter deficiencies in the abstract state representation by optimizing for myopic (nextstep) rewards which it can do efficiently by accessing the complete world state. This is particularly advantageous when it is possible to pretrain the policy to imitate a myopic human policy (e.g. by imitating single actions or preferences given by humans) or with a policy learned on a similar task. Furthermore, as with state aggregation, the grouping may incorporate prior structural knowledge. As shown by the experiments, by leveraging simple knowledge of the problem domain significant performance improvements are obtained.
Our contributions are threefold. First, we propose a modelbased RL method, called the Bottleneck Simulator, which learns an approximate transition distribution with discrete abstract states acting as information bottlenecks. We formally define the Bottleneck Simulator and its corresponding Markov decision process (MDP) and describe the training algorithm in details. Second, we provide a mathematical analysis based on fixed points. We provide two upper bounds on the error incurred when learning a policy with an approximate transition distribution: one for general approximate transition distributions and one for the Bottleneck Simulator. In particular, the second bound illustrates how the overall error may be attributed to four distinct sources: an error related to the abstract space structure (structural discrepancy), an error related to the transition model estimation variance, an error related to the transition model estimation bias, and an error related to the transition model class bias. Finally, we demonstrate the dataefficiency of the Bottleneck Simulator on two tasks involving few data examples: a text adventure game and a realworld, complex dialogue response selection task. We demonstrate how efficient abstractions may be constructed and show that the Bottleneck Simulator beats competing methods on both tasks. Finally, we investigate the learned policies qualitatively and, for the text adventure game, measure how performance changes as a function of the learned abstraction structure.
2 Background
2.1 Definitions
A Markov decision process (MDP) is a tuple , where is the set of states, is the set of actions,
is the state transition probability function,
is the reward function, with , and is the discount factor (Sutton & Barto, 1998). We adopt the standard MDP formulation with finite horizon. At time , the agent is in a state , takes an action , receives a reward and transitions to a new state with probability .We assume the agent follows a stochastic policy . Given a state , the policy assigns a probability to each possible action : . The agent’s goal is to learn a policy maximizing the discounted sum of rewards: , called the cumulative return. or, more briefly, the return.
Given a policy , the statevalue function is defined as the expected return of the policy starting in state :
(1) 
The stateactionvalue function is the expected return of taking action in state , and then following policy :
(2) 
An optimal policy is a policy satisfying :
(3) 
The optimal policy can be found via dynamic programming using the Bellman optimality equations (Bertsekas & Tsitsiklis, 1995; Sutton & Barto, 1998), :
(4)  
which hold if and only if eq. (3) is satisfied. Popular algorithms for finding an optimal policy include Qlearning, SARSA and REINFORCE (Sutton & Barto, 1998).
2.2 Modelbased RL with Approximate Transition Models
Suppose we aim to learn an efficient policy for the MDP , but without having access to the transition distribution . However, suppose that we still have access to the set of states and actions, the discount factor and the reward function for each stateaction pair . This is a plausible setting for many realworld applications, including natural language processing, health care and robotics.
Suppose that a dataset with tuples has been collected with a policy acting in the true (ground) MDP (Sutton, 1990; Moore & Atkeson, 1993; Peng & Williams, 1993).^{1}^{1}1Since the reward function is assumed known and deterministic, the dataset does not need to contain the rewards. We can use the dataset to estimate an approximate transition distribution :
Given , we can form an approximate MDP and learn a policy satisfying the Bellman equations in the approximate MDP, :
(5)  
in the hope that implies for policy .
The most common approach is to learn by counting cooccurrences in (Moore & Atkeson, 1993):
(6) 
where is the observation count for and is the observation count for followed by any other state. Unfortunately, this approximation is not sample efficient, because its sample complexity for accurately estimating the transition probabilities may grow in the order of (see appendix).
The next section presents the Bottleneck Simulator, which learns a more sample efficient model and implements an inductive bias by utilizing information bottlenecks
3 Bottleneck Simulator
3.1 Definition
We are now ready to define the Bottleneck Simulator, which is given by the tuple , where is a discrete set of abstract states, is the set of (full) states, is the set of actions and is a set of distributions.^{2}^{2}2In the POMDP literature, often represents the observation. However, in our notation, represents the discrete, abstract state. Further, we in general assume that .
The Bottleneck Simulator is illustrated in Figure 1. Conditioned on an abstract state , a state is sampled. Conditioned on a state and an action , a reward is outputted. Finally, conditioned on a state and an action , the next abstract state is sampled. Formally, the following distributions are defined:
Initial distribution of  (7)  
Transition distribution of  (8)  
Conditional distribution of  (9) 
When viewed as a Markov chain, the abstract state
is a Markovian state: given a sequence , all future variables depend only on . As such, the abstract state acts as an information bottleneck, since it has a much lower cardinality than the full states (i.e. ). This bottleneck helps reduce sparsity and improve generalization. Furthermore, the representation forcan be learned using unsupervised learning or supervised learning on another task. It may also incorporate domainspecific knowledge.
Further, assume that for each state there exists exactly one abstract state where assigns nonzero probability. Formally, let be a known surjective function mapping from to , such that:
(10) 
This assumption allows us to construct a simple estimator for the transition distribution based on . Given a dataset of tuples collected under a policy acting in the true MDP, we may estimate as:
This approximation has a sample complexity in the order of (see appendix). This should be compared to the estimator discussed previously, based on counting cooccurrences, which had a sample complexity of . For , clearly . As such this estimator is likely to achieve lower variance transition probability estimates for problems with large state spaces.
However, the lower variance comes at the cost of an increased bias. By partitioning the states into groups, the abstract states must contain all salient information required to estimate the true transition probabilities:
If the abstract states cannot sufficiently capture this information, then the approximation w.r.t. the true transition distribution will be poor. This in turn is likely to cause the policy learned in the Bottleneck Simulator to yield poor performance. The same drawback applies to common state aggregation methods (e.g. state aggregation (Bean et al., 1987; Bertsekas & Castanon, 1989)). However, unlike aggregation method, the policy learned with the Bottleneck Simulator has access to the complete, world state. Finally, it should be noted that the countbased model for is still rather naïve and inefficient. In the next subsection, we propose a more efficient method for learning .
3.2 Learning
We assume that is known. The transition distributions can be learned using a parametric classification model (e.g. a neural network with softmax output) by optimizing its parameters w.r.t. loglikelihood. Denote by
the transition distribution of the Bottleneck Simulator parametrized by a vector of parameters
. Formally, we aim to optimize:This breaks the learning problem down into two optimization problems, which are solved separately. In the appendix we propose a method for learning and jointly.
4 Mathematical Analysis
In this section we develop two upper bounds related to the estimation error of the stateactionvalue function learned in an approximate MDP. The first bound pertains to a general class of approximate MDPs and illustrates the relationship between the learned stateactionvalue function and the accuracy of the approximate transition distribution. The second bound relies on the hierarchical latent structure and applies specifically to the Bottleneck Simulator. This bound illustrates how the Bottleneck Simulator may learn a better policy by tradingoff between four separate factors.
Define the true MDP as a tuple , where is the set of states, is the set of actions, is the true state transition distribution, is the true reward function and is the discount factor.
Let the tuple be an approximate MDP, where is the transition function. All other variables are the same as given in the true MDP. Let satisfy eq. (5). This approximate MDP will serve as a reference for comparison.
Let the tuple be the Bottleneck Simulator, where is the discrete set of abstract states and is the transition function, as defined in the previous section. All other variables are the same as given in the true MDP. Finally, let be the optimal stateactionvalue function w.r.t. the Bottleneck Simulator:
We derive bounds on the loss defined in terms of distance between and suboptimal fixed points and :
where is the infinity norm (max norm). In other words, we bound the maximum absolute difference between the estimated return for any tuple between the approximate stateaction value and the stateaction value of the optimal policy. The same loss criteria was proposed by Bertsekas & Tsitsiklis (1995, Chapter 6) as well as others.
Our first theorem bounds the loss w.r.t. an approximate MDP using either the total variation distance or the KullbackLeibler divergence (KLdivergence). This theorem follows as a simple extension of existing results in the literature
(Ross, 2013; Talvitie, 2015).Theorem 1.
. Let be the optimal stateactionvalue function w.r.t. the approximate MDP , and let be the optimal stateactionvalue function w.r.t. the true MDP . Let be their contraction rates. Then it holds that:
(11)  
(12)  
(13) 
where is the conditional KLdivergence between and .
Proof.
See appendix. ∎
Eqs. (12) and (13) provide general bounds for any approximate transition distribution (including ). The bounds are asymptotically tight in the sense that when converges to both the bounds go to zero. Finally, the looser bound in eq. (13) motivates why the approximate transition distribution might be learned using crossentropy loss (or, equivalently, maximum loglikelihood).
Our second theorem bounds the loss specifically w.r.t. the Bottleneck Simulator, under the condition that if two states belong to the same abstract state (i.e. ) then their statevalue functions are close to each other w.r.t. the optimal policy : for some if . This statevalue similarity is closely related to metrics based on the notion of bisimulation (Dean et al., 1997; Ferns et al., 2004, 2012; Abel et al., 2016). The theorem is related to the results obtained by Ross (2013, p. 257) and Ross & Bagnell (2012), though their assumptions are different which in turn yields a bound in terms of expectations.
Theorem 2.
Let be the optimal stateactionvalue function w.r.t. the Bottleneck Simulator , and let be the optimal stateactionvalue function w.r.t. the true MDP . Let be their contraction rates, and define:
(14) 
Then it holds that:
(15)  
where and are defined as:
(16)  
(17)  
(18)  
(19) 
is the state visitation distribution under policy , and satisfies:
(20) 
Proof.
See appendix.∎
The bound in eq. (15) consists of four error terms, each with an important interpretation:
Structural Discrepancy: The structural discrepancy is defined in eq. (14) and measures the discrepancy (or dissimilarity) between state values within each partition. By assigning states with similar expected returns to the same abstract partitions, the discrepancy is decreased. Further, by increasing the number of abstract states (for example, by breaking large partitions into smaller ones), the discrepancy is also decreased. The discrepancy depends only on and , which makes it independent of any collected dataset . Unlike the previous bound in eq. (12), the discrepancy remains constant for and . However, in practice, as more data is accumulated it is of course desirable to enlarge with new states. In particular, if is grown large enough such that each state belongs to its own abstract state (e.g. ) then it can be shown that this term equals zero.
Transition Model Estimation Variance: This error term is a variant of the total variation distance between and , where each term is weighted by the minimum statevalue function within each abstract state . The distribution represents the most accurate learned under the policy under the constraint that the model factorizes as . In other words, corresponds to estimated on an infinite dataset collected under the policy . As such, this error term is analogous to the variance term in the biasvariance decomposition for supervised learning problems. Furthermore, suppose that . In this case, the last two error terms in the bound are exactly zero, and this error term is smaller than the general bound in eq. (12) since applying yields:
For problems with large state spaces or with extreme state values, we might expect for the majority of states . In this case, this bound would be far smaller than the general bound given in eq. (12). Finally, we may observe the sampling complexity of this error term. Under the simple counting distribution introduced earlier, has parameters. This suggests only samples are required to reach a certain accuracy. In contrast, for the general with a counting distribution, the sampling complexity grows with . For , we would expect this term to decrease on the order of times faster than the term given in eq. (12).
Transition Model Estimation Bias: This error term measures the weighted total variation distance between and , where each term is weighted by . In other words, it measures the distance between the most accurate approximate transition distribution , obtainable from an infinite dataset collected under policy , and the optimal factorized transition distribution (i.e. the transition distribution with the minimum sum of weighted absolute differences to the true transition distribution). As such, this error term represents the systematic bias induced by the behaviour policy .
Transition Model Class Bias: This error term measures the weighted total variation distance between and , where each term is weighted by
. It represents the systematic bias induced by the restricted class of probability distributions, which factorize according to latent abstract states with the mapping
. As such, this error term is analogous to the bias term in the biasvariance decomposition for supervised learning problems. As more data is accumulated it is possible to enlarge with new states. In particular, if is grown large enough, such that each state belongs to its own abstract state, and if is a tabular function, then this error term will become zero.The bound in eq. (15) offers more than a theoretical analysis. In the extreme case where , the bound inspires hope that we may yet learn an effective policy if only we can learn an abstraction with small structural discrepancy.
5 Experiments
We carry out experiments on two natural language processing tasks in order to evaluate the performance of the Bottleneck Simulator and to compare it to other approaches. Many realworld natural language processing tasks involve complex, stochastic structures, which have to be modelled accurately in order to learn an effective policy. Here, large amounts of training data (e.g. training signals for onpolicy learning, or observed trajectories of human agents executing similar tasks) are often not available. This makes these tasks particularly suitable for demonstrating the advantages of the Bottleneck Simulator related to dataefficiency, including improved performance based on few samples.
5.1 Text Adventure Game
The first task is the text adventure game Home World introduced by Narasimhan et al. (2015). The environment consists of four connected rooms: kitchen, bedroom, garden and living room. The game’s objective involves executing a given task in a specific room, such as eating an apple in the kitchen when the task objective is ”You are hungry”. The agent receives a reward of once the task is completed. Further, we adopt the more complex setting where the objectives are redundant and confusing, such as ”You are not hungry but now you are sleepy.”. The vocabulary size is 84. The environment has 192 unique states and 22 actions.
Setup: We use the same experimental setup and hyperparameters as Narasimhan et al. (2015)
for our baseline. We train an stateactionvalue function baseline policy parametrized as a feedforward neural network with Qlearning. The baseline policy is trained until the average percentage of games completed reaches
. Then, we estimate the Bottleneck Simulator environment model with the episodes collected thus far (1500 transitions). On the collected episodes, we learn the mapping from states to abstract states, , by applying means clustering using Euclidean distance to the word embeddings computed on the objective text and current observation text. We use Glove word embeddings (Pennington et al., 2014). We train the transition model on the collected episodes. The transition model is a twolayer MLP classifier predicting a probability for each clusterid of the next state given a state and action. Finally, we train a twolayer MLP predicting the reward given a state and action. This MLP defines the reward function in the Bottleneck Simulator environment model.
Policy: We initialize the Bottleneck Simulator policy from the baseline policy and continue training it by rolling out simulations in the Bottleneck Simulator environment model. For every 150 rollouts Bottleneck Simulator environment model, we evaluate the policy in the real game by letting the agent play out 20 episodes and measure the percentage of completed games. We stop training when the percentage of completed games stops improving.
Benchmark Policies: We compare the Bottleneck Simulator policy to two benchmark policies. The first is the baseline policy trained with Qlearning. The second is a policy trained with a state abstraction method, which we call State Abstraction. The observed states are mapped to abstract states , where are the same set of abstract states utilized by the Bottleneck Simulator environment model. As with the Bottleneck Simulator environment model, the function is used to map from states to abstract states. The action space was not modified. As with the Bottleneck Simulator policy, we evaluate the State Abstraction policy every 150 episodes by letting the agent play out another 20 episodes and measure the percentage of completed games. For the final evaluation, we select the State Abstraction policy which obtained the highest percentage of completed games.
Evaluation: The results are given in Table 1, averaged over 10 runs for different cluster sizes. We evaluate the policies based on the percentage of games completed.^{3}^{3}3It should be noted that the State Abstraction policy diverged on average two out of ten times in the experiment. None of the other policies appeared to have diverged. It is important to note that since our goal is to evaluate sample efficiency, our policies have been trained on far fewer episodes compared to Narasimhan et al. (2015).^{4}^{4}4Indeed, a tabular stateactionvalue function could straightforwardly be trained with Qlearning to solve this task perfectly if given enough training examples. Further, we have retained the baseline policy’s hyperparameters for the Bottleneck Simulator policies while reporting the results. We observe peak performance at for the Bottleneck Simulator policy with a cluster size of , which is significantly higher than the State Abstraction policy at and the baseline policy at only . This shows empirically that the Bottleneck Simulator policy is the most sample efficient algorithm, since all policies have been trained on the same number of examples. Finally, it should be noted that the State Abstraction and Bottleneck Simulator policies are complementary and could potentially be combined (e.g. by training a State Abstraction policy from samples generated by the Bottleneck Simulator environment model).
Number of Clusters (i.e. )  

Policy  4  16  24  32 
Qlearning (Baseline)  
State Abstraction  
Bottleneck Simulator 
95% confidence intervals). Qlearning baseline policy was trained once until reaching 15% game completion on average.
5.2 Dialogue
The second task is a realworld problem, where the agent must select appropriate responses in social, chitchat conversations. The task is the 2017 Amazon Alexa Prize Competition (Ram et al., 2017), where a spoken dialogue system must converse coherently and engagingly with humans on popular topics (e.g. entertainment, fashion, politics, sports).^{5}^{5}5See also https://developer.amazon.com/alexaprize/2017alexaprize.
Setup: We experiment with a dialogue system consisting of an ensemble of 22 response models. The response models take as input a dialogue and output responses in natural language text. In addition, the response models may also output one or several scalar values, indicating confidence levels. The response models have each their own internal procedure for generating responses: some response models are based on information retrieval models, others on generative language models, and yet others on templatebased procedures. Taken together, these response models output a diverse set of responses. The dialogue system is described further in Serban et al. (2017a) (see also Serban et al. (2017b)).
The agent’s task is to select an appropriate response from the set of responses, in order to maximize the satisfaction of the human user. At the end of each dialogue, the user gives a score between 1 (low satisfaction) and 5 (high satisfaction).
Prior to this experiment, a few thousand dialogues were recorded between users and two other agents acting with greedy exploration. These dialogues are used for training the Bottleneck Simulator and the benchmark policies. In addition, about 200,000 labels were annotated at the dialogueturnlevel using crowdsourcing: for each recorded dialogue, an annotator was shown a dialogue and several system responses (the actual response selected by the agent as well as alternative responses) and asked to score each between 1 (very poor) and 5 (excellent).
Policy: The Bottleneck Simulator policy is trained using discounted Qlearning on rollout simulations from the Bottleneck Simulator environment model. The policy is parametrized as an stateactionvalue function , taking as input the dialogue history and a candidate response . Based on the dialogue history and candidate response , 1458 features are computed, including word embeddings, dialogue acts, partofspeech tags, unigram and bigram word overlap, and modelspecific features. These features are given as input to a fivelayered feedforward neural network, which then outputs the estimated stateaction value. Further details on the model architecture are given in the appendix.
Abstraction Space: As defined earlier, let be the set of abstract states used by the Bottleneck Simulator environment model. We then define as the Cartesian product:
where , and are three discrete sets. The first set consists of dialogue acts, representing highlevel user intentions (Stolcke et al., 2000): . These dialogue acts represent the highlevel intention of the user’s utterance. The second set consists of sentiments types:
. The third set contains the binary variable:
. This variable is True only when the user utterance is generic and topicindependent (i.e. when the user utterance only contains stopwords). We develop a deterministic classifier mapping dialogue histories to corresponding classes in , and . Although we only consider dialogue acts, sentiment and generic utterances, it is trivial to expand the abstract state with other types of information.Transition Model: The Bottleneck Simulator environment model uses a transition distribution parametrized by three independent twolayer MLP models. All three MLP models take as input the same features as the Bottleneck Simulator policy, as well as features related to the dialogue act, sentiment and generic property of the last user utterance. The first MLP predicts the next dialogue act (), the second MLP predicts the next sentiment type () and the third MLP predicts whether the next user utterance is generic (). The training dataset consists of recorded dialogue transitions, of which of the dialogues are used as training set and
of the dialogues are used as validation set. The MLPs are trained with crossentropy using minibatch stochastic gradient descent. During rollout simulations, given a dialogue history
and an action selected by the policy, the next abstract state is sampled according to the predicted probability distributions of the three MLP models. Then, a corresponding next dialogue history is sampled at uniformly random from the set of recorded dialogues, under the constraint that the dialogue history matches the abstract state (i.e. ).Reward Model: The Bottleneck Simulator environment model uses a reward model parametrized as a feedforward neural network with a softmax output layer. The reward model is trained to estimate the reward for each action based on the 200,000 crowdsourced labels. When rolling out simulations with the Bottleneck Simulator, the expected reward is given to the agent at each time step. Unless otherwise stated, in the remainder of this section, this is the model we will refer to as the learned, approximate reward model.
Benchmark Policies: We compare the Bottleneck Simulator policy to seven competing methods:

[align=left,itemindent=0.05cm,itemsep=0.05cm,font=]
 Heuristic:

a heuristic policy based on predefined rules.
 Supervised:

an stateactionvalue function policy trained with supervised learning (crossentropy) to predict the annotated scores on the 200,000 crowdsourced labels.
 Qlearning:

an stateactionvalue function policy trained with discounted Qlearning on the recorded dialogues, where episode returns are given by a learned, approximate reward model.
 QFunction Approx:

an stateactionvalue function policy trained on the. 500,000 recorded transitions with a leastsquares regression loss, where the target values are given by a learned, approximate reward model.
 REINFORCE:

an offpolicy REINFORCE policy trained with reward shaping on the 500,000 recorded transitions, where episode returns are given by the final user scores.
 REINFORCE Critic:

an offpolicy REINFORCE policy trained with reward shaping on the 500,000 recorded transitions, where episode returns are given by a learned, approximate reward model.
 State Abstraction:

a tabular stateactionvalue function policy trained with discounted Qlearning on rollouts from the Bottleneck Simulator environment model, with abstract policy state space containing discrete abstract states and action space containing abstract actions, and where episode returns are given by a learned, approximate reward model.
The two offpolicy REINFORCE policies were trained with the action probabilities of the recorded dialogues (information which none of the other policies used).
With the exception of the Heuristic and State Abstraction policies, all policies were parametrized as fivelayered feedforward neural networks. Furthermore, the Bottleneck Simulator, the Qlearning, the QFunction Approx. and the two offpolicy REINFORCE policies were all initialized from the Supervised policy. This is analogous to pretraining the policies to imitate a myopic human policy (i.e. imitating the immediate actions of humans in given states). For these policies, the first few hidden layers were kept fixed after initialization from the Supervised policy, due to the large number of parameters. See appendix for details.
Crowdsourced  Simulated Rollouts  

Policy  Human Score  Return  Avg Reward 
Heuristic  
Supervised  
Qlearning  
QFunction Approx.  
REINFORCE  
REINFORCE Critic  
State Abstraction  
Bottleneck Sim. 
Policy  Exp 1  Exp 2  Exp 3 

Heuristic      
Supervised      
QFunction Approx.      
REINFORCE  
REINFORCE Critic      
Bottleneck Sim.  * 
Word Overlap  

Policy  System NPs  This Turn  Next Turn 
Heuristic  
Supervised  
QFunction Approx.  
REINFORCE  
REINFORCE Critic  
Bottleneck Sim. 
Preliminary Evaluation: We use two methods to perform a preliminary evaluation of the learned policies.
The first method evaluates each policy using the crowdsourced human scores. For each dialogue history, the policy must select one of the corresponding annotated responses. Afterwards, the policy receives the humanannotated score as reward. Finally, we compute the average humanannotated score of each policy. This evaluation serves as a useful approximation of the immediate, average reward a policy would get on the set of annotated dialogues.^{6}^{6}6The feedforward neural network policies were all pretrained with crossentropy to predict the training set of the crowdsourced labels, such that their second last layer computes the probability of each human score (see appendix for details). Therefore, the output of their last hidden layer is used to select the response in the crowdsourced evaluation. Note further that the crowdsource evaluation is carried out on the heldout test set of crowdsourced labels, while the neural network parameters were trained on the training set of crowdsourced labels.
The second method evaluates each policy by running 500 rollout simulations in the Bottleneck Simulator environment model, and computes the average return and average reward per time step. The rollouts are carried out on the heldout validation set of dialogue transitions (i.e. only states , which occur in the heldout validation set are sampled during rollouts). Although the Bottleneck Simulator environment model is far from an accurate representation of the real world, it has been trained with crossentropy (maximum loglikelihood) on 500,000 recorded transitions. Therefore, the rollout simulations might serve as a useful first approximation of how a policy might perform when interacting with realworld users. The exception to this interpretation is the Bottleneck Simulator and State Abstraction policies, which themselves utilized rollout simulations from the Bottleneck Simulator environment model during training. Because of this, it is possible that the these two policies might be overfitting the Bottleneck Simulator environment model and, in turn, that this evaluation might be overestimating their performance. Therefore, we will not consider strong performance of either of these two policies here as indicating that they are superior to other policies.
The results are given in Table 2. On the crowdsourced evaluation, the Supervised policy and all policies initialized perform decently reaching an average human score of . This is to be expected, since the Supervised policy is trained only to maximize the crowdsourced human scores. However, the Heuristic policy performs significantly worse indicating that there is much improvement to be made on top of the predefined rules. Further, the State Abstraction policy performs worst out of all the policies, indicating that the abstract stateaction space cannot effectively capture important aspects of the states and actions to learn a useful policy for this complex task. On the rollout simulation evaluation, we observe that the Supervised policy, Qlearning policy, and Bottleneck Simulator policy are tied for first place. Since the Bottleneck Simulator policy performs similarly to the other policies here, it would appear that the policy has not overfitted the Bottleneck Simulator environment model. After these policies follow the two REINFORCE policies and the Heuristic policy. Second last comes the State Abstraction policy, which again indicates that the state abstraction method is insufficient for this complex task. Finally, the Qfunction Approx. appears to perform the worst, suggesting that the learned, approximate reward model it was trained with does not perform well.
This section provided as a preliminary evaluation of the policies. The next section will provide a largescale, realworld user evaluation.
RealWorld Evaluation: We carry out a largescale evaluation with realworld users through three A/B testing experiments conducted during the Amazon Alexa Prize Competition, between July 29th  August 21st, 2017. In the first experiment the Heuristic, Supervised, QFunction Approx., REINFORCE, REINFORCE Critic and Bottleneck Simulator policies were evaluated. In the next two experiments only the Bottleneck Simulator and REINFORCE policies were evaluated. In total, 3000 user scores were collected.
The average user scores are given in Table 3. We observe that the Bottleneck Simulator policy performed best in both the first and third experiments. This shows that the Bottleneck Simulator policy has learned an effective policy, which is in agreement with the preliminary evaluation. On the other hand, the REINFORCE policy performed best in the second experiment. This shows that the REINFORCE policy is the most fierce contender of the Bottleneck Simulator policy. In line with the preliminary evaluation, the REINFORCE Critic and QFunction Approx. perform worse than the REINFORCE and Bottleneck Simulator policies. Finally, in contrast to the preliminary evaluation, the Supervised policy performs worse than all other policies evaluated, though not significantly worse than the Heuristic policy.
Next, we conduct an analysis of the policies in the first experiment w.r.t. topical specificity and topical coherence. For topical specificity, we measure the average number of noun phrases per system utterance. A topicspecific policy will score high on this metric. For topical coherence, we measure the word overlap between the user’s utterance and the system’s response, as well as word overlap between the user’s utterance and the system’s response at the next turn. The more a policy remains on topic, the higher we would expect these two metrics to be. A good policy should have both high topical specificity and high topical coherence.
As shown in Table 4, the Bottleneck Simulator policy performed best on all three metrics. This indicates that the Bottleneck Simulator has the most coherent and engaging dialogues out of all the evaluated policies. This is in agreement with its excellent performance w.r.t. realworld user scores and w.r.t. the preliminary evaluation. A further analysis of the selected responses indicates that the Bottleneck Simulator has learned a more risk tolerant strategy.
6 Related Work
Modelbased RL: Modelbased RL research dates back to the 90s, and includes wellknown algorithms such as Dyna, Rmax and (Sutton, 1990; Moore & Atkeson, 1993; Peng & Williams, 1993; Kuvayev & Sutton, 1997; Brafman & Tennenholtz, 2002; Kearns & Singh, 2002; Wiering & Schmidhuber, 1998; Wiering et al., 1999)
. Modelbased RL with deep learning has also been investigated, in particular for robotic control tasks
(Watter et al., 2015; Lenz et al., 2015; Gu et al., 2016; Finn & Levine, 2017). Sample efficient approaches have also been proposed by taking a Bayesian approach to learning the dynamics model. For example, PILCO incorporates uncertainty by learning a distribution over models of the dynamics in conjunction with the agent’s policy (Deisenroth & Rasmussen, 2011). Another approach based on Bayesian optimization was proposed by Bansal et al. (2017). An approach combining dynamics models of various levels of fidelity or accuracy was proposed by Cutler et al. (2015). Other related work includes Oh et al. (2015), Venkatraman et al. (2016), Kansky et al. (2017) and Racanière et al. (2017).The idea of grouping similar states together also has a long history in the RL community. Numerous algorithms exists for models based on state abstraction (state aggregation) (Bean et al., 1987; Bertsekas & Castanon, 1989; Dean et al., 1997; Dietterich, 2000; Jong & Stone, 2005; Li et al., 2006; Jiang et al., 2015). The main idea of state abstraction is to group together similar states and solve the reduced MDP. Solving the optimization problem in the reduced MDP requires far fewer iterations or samples, which improves convergence speed and sample efficiency. In particular, related theoretical analyses of the regret incurred by state abstraction methods are provided by Van Roy (2006) and Petrik & Subramanian (2014). In contrast to state abstraction, in the Bottleneck Simulator the grouping is applied exclusively within the approximate transition model while the agent’s policy operates on the complete, observed state. Compared to state abstraction, the Bottleneck Simulator reduces the impact of compounding errors caused by inaccurate abstractions in the approximate transition model. By giving the policy access to the complete, observed state, it may counter inaccurate abstractions by optimizing for myopic (nextstep) rewards. This enables pretraining the policy to mimic a myopically optimal policy (e.g. single human actions), as is the case in the dialogue response selection task. Furthermore, the Bottleneck Simulator allows a deep neural network policy to learn its own highlevel, distributed representations of the state from scratch. Finally, the Bottleneck Simulator enables a mathematical analysis of the tradeoffs incurred by the learned transition model in terms of structural discrepancy and weighted variation distances, which is not possible in the case of general, approximate transition models.
In a related vein, learning a factorized MDP (for example, by factorizing the state transition model) has also been investigated extensively in the literature (Boutilier et al., 1999; Degris et al., 2006; Strehl et al., 2007; Ross & Pineau, 2008; Bellemare et al., 2013; Wu et al., 2014; Osband & Van Roy, 2014; Hallak et al., 2015). For example, Ross & Pineau (2008) develop an efficient Bayesian framework for learning factorized MDPs. As another example, Bellemare et al. (2013) propose a Bayesian framework for learning a factored environment model based on a class of recursively decomposable factorizations. An important line of work in this area are stochastic factorization models (Barreto et al., 2015, 2016). Similar to nonnegative matrix factorization (NMF), these models approximate the environment transition model with matrices . Similar to other methods, these models may improve sample efficiency when the intrinsic dimensionality of the transition model is low. However, in comparison to the Bottleneck Simulator and other methods, it is difficult to incorporate domainspecific knowledge since and are learned from scratch. In contrast to the Bottleneck Simulator and other state abstraction methods, there is no constraint for each state to belong to exactly one abstract state. Whether or not this constraint improves or deteriorates performance is task specific. However, without imposing this constraint, it seems unlikely that one can provide a mathematical analysis of policy performance in terms of structural discrepancy.
Dialogue Systems: Numerous researchers have applied RL for training goaloriented dialogue systems (Singh et al., 1999; Williams & Young, 2007; Pieraccini et al., 2009). One line of research has focused on learning dialogue systems through simulations using abstract dialogue states and actions (Eckert et al., 1997; Levin et al., 2000; Chung, 2004; Cuayáhuitl et al., 2005; Georgila et al., 2006; Schatzmann et al., 2007; Heeman, 2009; Traum et al., 2008; Lee & Eskenazi, 2012; Khouzaimi et al., 2017; LópezCózar, 2016; Su et al., 2016; Fatemi et al., 2016; Asri et al., 2016). The approaches here differ based on how the simulator is created or estimated, and whether or not the simulator is also considered an agent trying to optimize its own reward. For example, Levin et al. (2000)
tackle the problem of building a flight booking dialogue system. They estimate a user simulator model by counting transition probabilities between abstract dialogue states and user actions (similar to an ngram model), which is then used to train an RL policy. As a more recent example,
Yu et al. (2016)propose to learn a dialogue manager policy through modelfree offpolicy RL based on simulations with a rulebased system. Researchers have also investigated learning generative neural network policies operating directing on raw text through user simulations
(Zhao & Eskenazi, 2016; Guo et al., 2016; Das et al., 2017; Lewis et al., 2017; Liu & Lane, 2017). In parallel to our work, Peng et al. (2018) have proposed a related modelbased reinforcement learning approach for dialogue utilizing the Dyna algorithm. To the best of our knowledge, the Bottleneck Simulator is the first modelbased RL approach with discrete, abstract states to be applied to learning a dialogue policy operating on raw text.7 Conclusion
We have proposed the Bottleneck Simulator, a modelbased reinforcement learning (RL) approach combining a learned, factorized environment transition model with rollout simulations to learn an effective policy from few data examples. The learned transition model employs an abstract, discrete state (a bottleneck state), which increases sample efficiency by reducing the number of model parameters and by exploiting structural properties of the environment. We have provided a mathematical analysis of the Bottleneck Simulator in terms of fixed points of the learned policy. The analysis reveals how the policy’s performance is affected by four distinct sources of errors related to the abstract space structure (structural discrepancy), to the transition model estimation variance, to the transition model estimation bias, and to the transition model class bias. We have evaluated the Bottleneck Simulator on two natural language processing tasks: a text adventure game and a realworld, complex dialogue response selection task. On both tasks, the Bottleneck Simulator has shown excellent performance beating competing approaches. In contrast to much of the previous work on abstraction in RL, our dialogue experiments are based on a complex, realworld task with a very highdimensional state space and evaluated by realworld users.
References
 Abel et al. (2016) Abel, D., Hershkowitz, D. E., and Littman, M. L. Near optimal behavior via approximate state abstraction. In ICML, 2016.
 Asri et al. (2016) Asri, L. E., He, J., and Suleman, K. A sequencetosequence model for user simulation in spoken dialogue systems. In InterSpeech, 2016.
 Bansal et al. (2017) Bansal, S., Calandra, R., Xiao, T., Levine, S., and Tomlin, C. J. GoalDriven Dynamics Learning via Bayesian Optimization. arXiv preprint arXiv:1703.09260, 2017.

Barreto et al. (2015)
Barreto, A. d. M. S., Beirigo, R. L., Pineau, J., and Precup, D.
An expectationmaximization algorithm to compute a stochastic factorization from data.
In IJCAI, pp. 3329–3336, 2015.  Barreto et al. (2016) Barreto, A. d. M. S., Beirigo, R. L., Pineau, J., and Precup, D. Incremental stochastic factorization for online reinforcement learning. In AAAI, pp. 1468–1475, 2016.
 Bean et al. (1987) Bean, J. C., Birge, J. R., and Smith, R. L. Aggregation in dynamic programming. Operations Research, 35(2):215–220, 1987.
 Bellemare et al. (2013) Bellemare, M., Veness, J., and Bowling, M. Bayesian learning of recursively factored environments. In ICML, pp. 1211–1219, 2013.
 Bertsekas & Castanon (1989) Bertsekas, D. P. and Castanon, D. A. Adaptive aggregation methods for infinite horizon dynamic programming. IEEE Transactions on Automatic Control, 34(6):589–598, 1989.
 Bertsekas & Tsitsiklis (1995) Bertsekas, D. P. and Tsitsiklis, J. N. Neurodynamic programming: an overview. In IEEE Conference on Decision and Control, 1995, volume 1, pp. 560–564. IEEE, 1995.
 Boutilier et al. (1999) Boutilier, C., Dean, T., and Hanks, S. Decisiontheoretic planning: Structural assumptions and computational leverage. JAIR, 11(1):94, 1999.
 Brafman & Tennenholtz (2002) Brafman, R. I. and Tennenholtz, M. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. JMLR, 3(Oct):213–231, 2002.
 Breiman (1996) Breiman, L. Bagging predictors. Machine learning, 24(2):123–140, 1996.
 Brown & Sandholm (2017) Brown, N. and Sandholm, T. Superhuman AI for headsup nolimit poker: Libratus beats top professionals. Science, pp. eaao1733, 2017.
 Chung (2004) Chung, G. Developing a flexible spoken dialog system using simulation. In ACL, pp. 63, 2004.

Cuayáhuitl et al. (2005)
Cuayáhuitl, H., Renals, S., Lemon, O., and Shimodaira, H.
Humancomputer dialogue simulation using hidden Markov models.
In IEEE Workshop on ASRU, pp. 290–295, 2005.  Cutler et al. (2015) Cutler, M., Walsh, T. J., and How, J. P. Realworld reinforcement learning via multifidelity simulators. IEEE Transactions on Robotics, 31(3):655–671, 2015.

Das et al. (2017)
Das, A., Kottur, S., Moura, J. M., Lee, S., and Batra, D.
Learning cooperative visual dialog agents with deep reinforcement
learning.
In
International Conference on Computer Vision
, 2017.  Dean et al. (1997) Dean, T., Givan, R., and Leach, S. Model reduction techniques for computing approximately optimal solutions for Markov decision processes. In UAI, pp. 124–131, 1997.
 Degris et al. (2006) Degris, T., Sigaud, O., and Wuillemin, P.H. Learning the structure of factored markov decision processes in reinforcement learning problems. In ICML, pp. 257–264. ACM, 2006.
 Deisenroth & Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. Pilco: A modelbased and dataefficient approach to policy search. In ICML, pp. 465–472, 2011.
 Dietterich (2000) Dietterich, T. G. Hierarchical reinforcement learning with the MAXQ value function decomposition. JAIR, 13:227–303, 2000.
 Eckert et al. (1997) Eckert, W., Levin, E., and Pieraccini, R. User modeling for spoken dialogue system evaluation. In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 80–87. IEEE, 1997.
 Fatemi et al. (2016) Fatemi, M., Asri, L. E., Schulz, H., He, J., and Suleman, K. Policy networks with twostage training for dialogue systems. In SIGDIAL, 2016.
 Ferns et al. (2004) Ferns, N., Panangaden, P., and Precup, D. Metrics for finite Markov decision processes. In UAI, pp. 162–169, 2004.
 Ferns et al. (2012) Ferns, N., Castro, P. S., Precup, D., and Panangaden, P. Methods for computing state similarity in Markov decision processes. arXiv preprint arXiv:1206.6836, 2012.
 Finn & Levine (2017) Finn, C. and Levine, S. Deep visual foresight for planning robot motion. In Robotics and Automation (ICRA), pp. 2786–2793. IEEE, 2017.
 Georgila et al. (2006) Georgila, K., Henderson, J., and Lemon, O. User simulation for spoken dialogue systems: Learning and evaluation. In Conference on Spoken Language Processing, 2006.
 Gu et al. (2016) Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep qlearning with modelbased acceleration. In ICML, pp. 2829–2838, 2016.
 Guo et al. (2016) Guo, X., Klinger, T., Rosenbaum, C., Bigus, J. P., et al. Learning to query, reason, and answer questions on ambiguous texts. In ICLR, 2016.
 Hallak et al. (2015) Hallak, A., Schnitzler, F., Mann, T., and Mannor, S. Offpolicy modelbased learning under unknown factored dynamics. In ICML, pp. 711–719, 2015.
 Heeman (2009) Heeman, P. A. Representing the reinforcement learning state in a negotiation dialogue. In IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 450–455. IEEE, 2009.
 Jiang et al. (2015) Jiang, N., Kulesza, A., and Singh, S. Abstraction selection in modelbased reinforcement learning. In ICML, pp. 179–188, 2015.
 Jong & Stone (2005) Jong, N. K. and Stone, P. State abstraction discovery from irrelevant state variables. In IJCAI, volume 8, pp. 752–757, 2005.
 Kansky et al. (2017) Kansky, K., Silver, T., Mély, D. A., Eldawy, M., et al. Schema networks: Zeroshot transfer with a generative causal model of intuitive physics. arXiv preprint arXiv:1706.04317, 2017.
 Kearns & Singh (2002) Kearns, M. and Singh, S. Nearoptimal reinforcement learning in polynomial time. Machine Learning, 49(23):209–232, 2002.
 Khouzaimi et al. (2017) Khouzaimi, H., Laroche, R., and Lefevre, F. Incremental humanmachine dialogue simulation. In Dialogues with Social Robots, pp. 53–66. Springer, 2017.
 Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.
 Kuvayev & Sutton (1997) Kuvayev, L. and Sutton, R. Approximation in modelbased learning. In Proceedings of the ICML, volume 97, 1997.
 Lee & Eskenazi (2012) Lee, S. and Eskenazi, M. POMDPbased let’s go system for spoken dialog challenge. In Spoken Language Technology Workshop (SLT), 2012 IEEE, pp. 61–66. IEEE, 2012.
 Lenz et al. (2015) Lenz, I., Knepper, R. A., and Saxena, A. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, 2015.
 Levin et al. (2000) Levin, E., Pieraccini, R., and Eckert, W. A stochastic model of humanmachine interaction for learning dialog strategies. IEEE Transactions on speech and audio processing, 8(1):11–23, 2000.
 Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
 Lewis et al. (2017) Lewis, M., Yarats, D., Dauphin, Y. N., Parikh, D., and Batra, D. Deal or No Deal? EndtoEnd Learning for Negotiation Dialogues. In EMNLP, 2017.
 Li et al. (2006) Li, L., Walsh, T. J., and Littman, M. L. Towards a unified theory of state abstraction for mdps. In ISAIM, 2006.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Liu & Lane (2017) Liu, B. and Lane, I. Iterative policy learning in endtoend trainable taskoriented neural dialog models. In Proceedings of 2017 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Okinawa, Japan, 2017.

Liu et al. (2016)
Liu, C.W., Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., and Pineau,
J.
How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.
In EMNLP, 2016.  LópezCózar (2016) LópezCózar, R. Automatic creation of scenarios for evaluating spoken dialogue systems via usersimulation. KnowledgeBased Systems, 106:51–73, 2016.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Moore & Atkeson (1993) Moore, A. W. and Atkeson, C. G. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13(1):103–130, 1993.
 Narasimhan et al. (2015) Narasimhan, K., Kulkarni, T., and Barzilay, R. Language Understanding for Textbased Games Using Deep Reinforcement Learning. ArXiv eprints, June 2015.
 Ng et al. (1999) Ng, A. Y., Harada, D., and Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In ICML, volume 99, pp. 278–287, 1999.
 Oh et al. (2015) Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. Actionconditional video prediction using deep networks in atari games. In NIPS, pp. 2863–2871, 2015.
 Osband & Van Roy (2014) Osband, I. and Van Roy, B. Nearoptimal Reinforcement Learning in Factored MDPs. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), NIPS, pp. 604–612, 2014.
 Peng et al. (2018) Peng, B., Li, X., Gao, J., Liu, J., and Wong, K.F. Integrating planning for taskcompletion dialogue policy learning. arXiv preprint arXiv:1801.06176, 2018.
 Peng & Williams (1993) Peng, J. and Williams, R. J. Efficient learning and planning within the dyna framework. Adaptive Behavior, 1(4):437–454, 1993.
 Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In EMNLP, 2014.
 Petrik & Subramanian (2014) Petrik, M. and Subramanian, D. Raam: The benefits of robustness in approximating aggregated mdps in reinforcement learning. In NIPS, pp. 1979–1987, 2014.
 Pieraccini et al. (2009) Pieraccini, R., Suendermann, D., Dayanidhi, K., and Liscombe, J. Are we there yet? research in commercial spoken dialog systems. In Text, Speech and Dialogue, pp. 3–13. Springer, 2009.
 Precup (2000) Precup, D. Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, 2000.
 Precup et al. (2001) Precup, D., Sutton, R. S., and Dasgupta, S. Offpolicy temporaldifference learning with function approximation. In ICML, 2001.
 Racanière et al. (2017) Racanière, S., Weber, T., Reichert, D., Buesing, L., et al. Imaginationaugmented agents for deep reinforcement learning. In NIPS, pp. 5694–5705, 2017.
 Ram et al. (2017) Ram, A., Prasad, R., Khatri, C., Venkatesh, A., et al. Conversational AI: The Science Behind the Alexa Prize. In Alexa Prize Proceedings, 2017.
 Ross (2013) Ross, S. Interactive learning for sequential decisions and predictions. PhD thesis, Carnegie Mellon University, 2013.
 Ross & Bagnell (2012) Ross, S. and Bagnell, J. A. Agnostic system identification for modelbased reinforcement learning. In ICML, 2012.

Ross & Pineau (2008)
Ross, S. and Pineau, J.
Modelbased bayesian reinforcement learning in large structured
domains.
In
Conference on Uncertainty in Artificial Intelligence
, volume 2008, pp. 476, 2008.  Schatzmann et al. (2007) Schatzmann, J., Thomson, B., Weilhammer, K., Ye, H., and Young, S. Agendabased user simulation for bootstrapping a pomdp dialogue system. In NAACLHLT, pp. 149–152. Association for Computational Linguistics, 2007.
 Schulman et al. (2015) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
 Serban et al. (2017a) Serban, I. V., Sankar, C., Zhang, S., Lin, Z., Subramanian, S., Kim, T., Chandar, S., Ke, N. R., et al. The octopus approach to the alexa competition: A deep ensemblebased socialbot. In Alexa Prize Proceedings, 2017a.
 Serban et al. (2017b) Serban, I. V., Sankar, C., Germain, M., Zhang, S., Lin, Z., Subramanian, S., Kim, T., Pieper, M., Chandar, S., Ke, N. R., et al. A deep reinforcement learning chatbot. arXiv preprint arXiv:1709.02349, 2017b.
 Shawar & Atwell (2007) Shawar, B. A. and Atwell, E. Chatbots: are they really useful? In LDV Forum, volume 22, 2007.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Silver et al. (2017) Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., et al. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
 Singh et al. (1999) Singh, S. P., Kearns, M. J., Litman, D. J., and Walker, M. A. Reinforcement learning for spoken dialogue systems. In NIPS, pp. 956–962, 1999.
 Stolcke et al. (2000) Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Martin, R., Van EssDykema, C., and Meteer, M. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3), 2000.
 Strehl et al. (2007) Strehl, A. L., Diuk, C., and Littman, M. L. Efficient structure learning in factoredstate mdps. In AAAI, volume 7, pp. 645–650, 2007.
 Su et al. (2016) Su, P.H., Gasic, M., Mrksic, N., RojasBarahona, L., et al. Continuously learning neural dialogue management. arXiv preprint arXiv:1606.02689, 2016.
 Sutton (1990) Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, pp. 216–224, 1990.
 Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. Number 1 in 1. MIT Press Cambridge, 1998.
 Talvitie (2015) Talvitie, E. Agnostic system identification for monte carlo planning. In AAAI, pp. 2986–2992, 2015.
 Tesauro (1995) Tesauro, G. Temporal difference learning and TDGammon. Communications of the ACM, 38(3):58–68, 1995.
 Traum et al. (2008) Traum, D., Marsella, S. C., Gratch, J., Lee, J., and Hartholt, A. Multiparty, multiissue, multistrategy negotiation for multimodal virtual agents. In International Workshop on Intelligent Virtual Agents, pp. 117–130. Springer, 2008.
 Tsybakov (2009) Tsybakov, A. B. Introduction to nonparametric estimation. revised and extended from the 2004 french original. translated by vladimir zaiats, 2009.
 Van Roy (2006) Van Roy, B. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
 Venkatraman et al. (2016) Venkatraman, A., Capobianco, R., Pinto, L., Hebert, M., Nardi, D., and Bagnell, J. A. Improved learning of dynamics models for control. In International Symposium on Experimental Robotics, pp. 703–713. Springer, 2016.
 Wallace (2009) Wallace, R. S. The anatomy of alice. Parsing the Turing Test, 2009.
 Watter et al. (2015) Watter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. Embed to control: A locally linear latent dynamics model for control from raw images. In NIPS, pp. 2746–2754, 2015.
 Wiering & Schmidhuber (1998) Wiering, M. and Schmidhuber, J. Efficient modelbased exploration. In Conference on Simulation of Adaptive Behavior, pp. 223–228, 1998.
 Wiering et al. (1999) Wiering, M., Sałustowicz, R., and Schmidhuber, J. Reinforcement learning soccer teams with incomplete world models. Autonomous Robots, 7(1):77–88, 1999.
 Williams & Young (2007) Williams, J. D. and Young, S. Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language, 21(2):393–422, 2007.
 Williams (1992) Williams, R. J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34), 1992.
 Wu et al. (2014) Wu, B., Feng, Y.P., and Zheng, H.Y. Modelbased Bayesian Reinforcement Learning in Factored Markov Decision Process. JCP, 9(4):845–850, 2014.
 Yu et al. (2016) Yu, Z., Xu, Z., Black, A. W., and Rudnicky, A. I. Strategy and policy learning for nontaskoriented conversational systems. In SIGDIAL, 2016.
 Zhao & Eskenazi (2016) Zhao, T. and Eskenazi, M. Towards endtoend learning for dialog state tracking and management using deep reinforcement learning. In SIGDIAL, 2016.
Appendix A Dynamic Programming Preliminaries
The Bellman optimality equations can be shortened by defining the Bellman operator (sometimes called the dynamic programming operator) (Bertsekas & Tsitsiklis, 1995, Chapter 2). For a given (not necessarily optimal) stateactionvalue function , the operator is:
(21) 
In other words, the operator updates towards with one dynamic programming iteration.
We need the following lemma, as derived by Jiang et al. (2015).
Lemma 1.
Let and be the fixed points for the Bellman optimality operators , , which both operate on and have contraction rate :
(22) 
Proof.
We prove the inequality by writing out the lefthand side, applying the triangle inequality and the Bellman residual bound (Bertsekas & Tsitsiklis, 1995, Chapter 2):
We move the first term on the righthand side to the other side of the inequality and reorder the terms: