Planning with Expectation Models

by   Yi Wan, et al.
University of Alberta

Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using approximate expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.


page 1

page 2

page 3

page 4


Planning with Expectation Models for Control

In model-based reinforcement learning (MBRL), Wan et al. (2019) showed c...

Regularizing Model-Based Planning with Energy-Based Models

Model-based reinforcement learning could enable sample-efficient learnin...

Exploring Model-based Planning with Policy Networks

Model-based reinforcement learning (MBRL) with model-predictive control ...

True to the Model or True to the Data?

A variety of recent papers discuss the application of Shapley values, a ...

Scalable Planning and Learning for Multiagent POMDPs: Extended Version

Online, sample-based planning algorithms for POMDPs have shown great pro...

Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping

We consider the problem of efficiently learning optimal control policies...

Approximate information state for approximate planning and reinforcement learning in partially observed systems

We propose a theoretical framework for approximate planning and learning...

1 Introduction

Learning models of the world and effectively planning with them remains a long-standing challenge in artificial intelligence. Model-based reinforcement learning formalizes this problem in the context of reinforcement learning where the model refers to the environment’s transition dynamics and reward function. Once the model of the environment has been learned, an agent can potentially use it to arrive at plans without needing to interact with the environment.

The output of the model is one of the key choices in the design of a planning agent, as it determines the way the model is used for planning. Should the model produce 1) a distribution over the next state feature vector, 2) a sample of the next state feature vector, or 3) the expected next state feature vector? For stochastic environments, distribution and sample models can be used effectively, particularly if the distribution can be assumed to be of a special form

[Deisenroth and Rasmussen2011, Chua et al.2018]. For arbitrarily stochastic environments, learning a sample or distribution model could be intractable or even impossible. For deterministic environments, expectation models appear to be the default choice as they are easier to learn and have been used [Oh et al.2015, Leibfried et al.2016]. However, for general stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we develop an approach to use expectation models for arbitrarily stochastic environments by restricting the value function to be linear in the state-feature vector.

Once the choice of expectation models with linear value function has been made, the next question is to develop an algorithm which uses the model for planning. In previous work, planning methods have been proposed which use expectation models for policy evaluation [Sutton et al.2012]. However, as we demonstrate empirically, the proposed methods require strong conditions on the model which might not hold in practice, causing the value function to diverge to infinity. Thus, a key challenge is to devise a sound planning algorithm which uses approximate expectation models for policy evaluation and has convergence guarantees. In this work, we propose a new objective function called Model Based-Mean Square Projected Bellman Error (MB-MSPBE) for policy evaluation and show how it relates to Mean Square Projected Bellman Error (MSPBE) [Sutton et al.2009]. We derive a planning algorithm which minimizes the proposed objective and show its convergence under conditions which are milder than the ones assumed in the previous work [Sutton et al.2012].

It is important to note that in this work, we focus on the value prediction task for model-based reinforcement learning. Predicting the value of a policy is an integral component of Generalized Policy Iteration(GPI) on which much of modern reinforcement learning control algorithms are built [Sutton and Barto2018]. Policy evaluation is also key for building predictive world knowledge where the questions about the world are formulated using value functions [Sutton et al.2011, Modayil et al.2014, White and others2015]. More recently, policy evaluation has also been shown to be useful for representation learning where value functions are used as auxiliary tasks [Jaderberg et al.2016]. While model-based reinforcement learning for policy evaluation is interesting in its own right, the ideas developed in this paper can also be extended to the control setting.

2 Problem Setting

We formalize an agent’s interaction with its environment by a finite Markov Decision Process (MDP) defined by the tuple

, where is a set of states, is a set of actions, is a set of rewards, model such that and , and is the discount factor. A stationary policy determines the behavior of the agent. The value function describes the expected discounted sum of rewards obtained by following policy . In this work, we assume the discounted bounded reward problem setting, i.e., and .

In practice, the agent does not have access to the states directly, but only through an -dimensional real-valued feature vector where is the feature mapping, which can be an arbitarily complex function for extracting the state-features. Tile-coding [Sutton1996] and Fourier basis [Konidaris et al.2011] are examples of state-feature mapping functions which are expert designed. An alternative is to learn the mapping using auxiliary tasks and approximate the value function using the learned state-features [Chung et al.2018, Jaderberg et al.2016]. In that case the value function is usually approximated using a parametrized function with an -dimensional weight vector , where typically . We write for the approximate value of state . The approximate value function can either be a linear function of the state-features where , or a non-linear function where is an arbitrary function. Similarly, it’s common to use the state feature vector as the input of the policy and both input and output of the approximate model, which we will discuss in the next section.

The Dyna architecture [Sutton1991]

is an MBRL algorithm which unifies learning, planning, and acting via updates to the value function. The agent interacts with the world, using observed state, action, next state, and reward tuples to estimate the model

, and update an estimate of the action-value function for policy . The planning step in Dyna repeatedly samples possible next state, and rewards from the model, given input state-action pairs. These hypothetical experiences can be used to update the action-value function, just as if they had been generated by interacting with the environment. The Search Control procedure decides what states and actions are used to query the model during planning. The efficiency of planning can be significantly improved with non-uniform search control such as prioritized sweeping [Moore and Atkeson1993, Sutton et al.2012, Pan et al.2018]. In the function approximation setting, there are three factors that can affect the solution of a planning algorithm: 1) the distribution of data used to train the model 2) the search control process’s distribution for selecting the starting feature vectors and actions for simulating the next feature vectors, and 2) the policy being evaluated.

Consider an agent wanting to evaluate a policy , i.e., approximate , using a Dyna-style planning algorithm. Assume that the data used to learn the model come from the agent’s interaction with the environment using policy

. It is common to have an ergodicity assumption on the markov chain induced by

, i.e.,

Assumption 2.1

The markov chain induced by policy b is ergodic.

Under this assumption we can define the expectation in terms of the unique stationary distribution, for example, .

Let denote ’s stationary state distribution and let as the set of all states sharing feature vector . Consequently, the stationary feature vector distribution corresponding to would be . Let’s suppose the search control process generates a sequence of i.i.d random vectors where each follows distribution , and chooses actions according to policy i.e. , which is the policy to be evaluated. The is usually assumed to be bounded.

Assumption 2.2

is bounded.

Since we assumed the finite MDP setting, the number of states, actions and feature vectors are all finite and the model output is, therefore, always bounded. For the uniqueness of the solution, it is also assumed that the feature vectors generated by the search-control process are linearly independent:

Assumption 2.3

is non-singular

3 Approximate Models

In the function approximation setting, an approximate model of the transition dynamics and the reward function is used for planning. The approximate model may map a state-feature vector and an action to either a distribution over the next state-feature vectors and rewards (distribution model), a sample from the distribution over the next-state feature vectors and rewards (sample model), or to the expected next state-feature vector and reward (expectation model). For the sake of brevity, we refer to the approximate models as just models, from here onwards.

A distribution model takes a state-feature vector and action as input and produces a distribution over the next-state feature vectors and rewards. Distribution models are deterministic as the input of the model completely determines the output. Distribution models have typically been used with special forms such a Gaussians [Chua et al.2018] or Gaussian processes [Deisenroth and Rasmussen2011]. In general, however, learning a distribution can be impractical as distributions are potentially large objects. For example, if the state is represented by a feature vector of dimension

, then the first moment of its distribution is a

-vector, but the second moment is a matrix, and the third moment is , and so on.

Sample models are a more practical alternative, as they only need to generate a sample of the next-state feature vector and reward, given a state-feature vector and action. Sample models can use arbitrary distributions to generate the samples—even though they do not explicitly represent those distributions—but can still produce objects of a limited size (e.g. feature vectors of dimension ). They are particularly well suited for sample-based planning methods such as Monte Carlo Tree Search [Coulom2006]. Unlike distribution models, however, sample models are stochastic which creates an additional branching factor in planning, as multiple samples are needed to be drawn to gain a representative sense of what might happen.

Expectation models are an even simpler approach, where the model produces the expectation

of the next-state feature vector and reward. The advantages of expectation models are that the state output is compact (like a sample model) and deterministic (like a distribution model). The potential disadvantage of an expectation model is that it is only a partial characterization of the distribution. For example, if the result of an action (or option) is that two binary state features both occur with probability 0.5, but are never present (=1) together, then an expectation model can capture the first part (each present with probability 0.5), but not the second (never both present). This may not be a substantive limitation, however, as we can always add a third binary state feature, for example, for the AND of the original two features, and then capture the full distribution with the expectation of all three state features.

4 Expectation Models and Linearity

Expectation models can be less complex than distribution and sample models and, therefore, can be easier to learn. This is especially critical for model-based reinforcement learning where the agent is to learn a model of the world and use it for planning. In this work, we focus on answering the question: how can expectation models be used for planning in Dyna, despite the fact that they are only a partial characterization of the transition dynamics?

There is a surprisingly simple answer to this question: if the value function is linear in the state features, then there is no loss of generality when using an expectation model for planning. Consider the dynamic programming update for policy evaluation with an distribution model , for a given feature vector and action


The second equation uses an expectation model corresponding to the distribution model. This result shows that no generality has been lost by using an expectation model, if the value function is linear. Further, the same equations also advocate the other direction: if we are using an expectation model, then the approximate value function should be linear. This is because (2) is unlikely to equal (1) for general distributions if is linear in state-features.

It is important to point out that linearity does not particularly restrict the expressiveness of the value function since the mapping could still be non-linear and, potentially, learned end-to-end using auxiliary tasks [Jaderberg et al.2016, Chung et al.2018].

5 Linear Non-Linear Expectation Models

We now consider the parameterization of the expectation model: should the model be a linear projection from state-features to the next state-features or should it be an arbitrary non-linear function? In this section, we discuss the two common choices and their implications in detail.

We assume a mapping for state-features and a value function which is linear in state-features. An approximate expectation model consists of a dynamics function and a reward function constructed such that and can be used as estimates of the expected feature vector and reward that follow from when an action is taken. The general case that and are arbitrary non-linear functions is what we call the non-linear expectation model. A special case is of the linear expectation model in which both of these functions are linear, i.e., and where is the transition matrix and is the expected reward vector.

We now define the best linear expectation model and the best non-linear expectation model trained using data generated by policy . In particular, let be the best linear expectation model where

For the uniqueness of the best linear model, we assume

Assumption 5.1

is non-singular

Under this assumption we have closed-from solution for linear model.

The best non-linear expectation model .

Both linear and non-linear models can be learned using samples via stochastic gradient descent.

5.1 Why Linear Models are Not Enough?

In previous work, linear expectation models have been used to simulate a transition and execute TD(0) update [Sutton et al.2012]. Convergence to the TD-fixed point using TD(0) updates with a non-action linear expectation model is shown in theorem 3.1 and 3.3 of [Sutton et al.2012]. An additional benefit of this method is that the point of convergence does not rely on the distribution of the search-control process. Critically, a non-action model cannot be used for evaluating an arbitrary policy, as it is tied to a single policy – the one that generates the data for learning the model. To evaluate multiple policies, an action model is required. In this case, the point of convergence of the algorithm is dependent on . From corollary 5.1 of [Sutton et al.2012], the convergent point of TD(0) update with action model is:


where . It is obvious that the convergence point changes as the feature vector generating distribution changes. We now ask even if equals , do the TD(0)-based planning updates converge to the TD fixed-point. In the next proposition we show that this is not true in general for the best linear model, however, it is true for the best non-linear model.

Let the TD-fixed point with real environment be , with the best-linear model be , and with the best non-linear model be (assuming they exist). We can write their expressions as follow:

Proposition 5.1


  1. assumptions 2.1, 5.1 hold

  2. .


Figure 1: A simple two state MDP in which two actions, red and blue, are available in each state. Each action causes the agent to probabilistically move to the next-state or stay in the same state. The specific transition probabilities are specified next to the arrows representing the transitions. The reward is zero everywhere except when the red action is taken in state . The policy used to generate the data for learning the model is given as: and . The feature vector has only one component: and

5.2 An Illustrative Example on the Limitation of Linear Models

In order to clearly elucidate the limitation of linear models for planning, we use a simple two-state MDP, as outlined in Figure 1. The policy used to generate the data for learning the model, and the the policy to be evaluated are also described in Figure 1. We learn a linear model with the data collected by interacting with the real system using policy and verify that it is the best linear model that could be obtained. We can then obtain using equation(3). The solution of the real system is then calculated by off-policy LSTD[Yu2010] using the same data that is used to learn the linear model. In agreement to proposition 5.1, the two resulting fixed points are considerably different: and .

Previous works [Parr et al.2008, Sutton et al.2012] showed that a non-action linear expectation model could just be enough if the value function is linear in features. Proposition 5.1 coupled with the above example suggests that this is not true for the more general case of linear expectation models, and expressive non-linear models could potentially be a better choice for planning with expectation models. From now on, we focus on non-linear models as the parametrization of choice for planning with expectation models.

6 Gradient-based Dyna-style Planning (GDP) Methods

In the previous section, we established that more expressive non-linear models are needed to recover the solution obtained by the real system. An equally crucial choice is that of the planning algorithm: do TD(0) planning updates converge to the fixed-point? We note that for this to be true in case of linear models, we require the numerical radius of to be less than 1 [Sutton et al.2012]. We conjecture that this condition might not hold in practice causing the planning to diverge. We illustrate this point using the Baird’s Counter Example [Baird1995] in the next section.

We also see from proposition 5.1, that the expected TD(0) planning update with the best non-linear model is the same as the expected model-free TD(0) update , where and . We know that for off-policy learning, TD(0) is not guaranteed to be stable. This suggests that even with the best non-linear model, the TD(0)-based planning algorithm is also prone to divergence.

Inspired by the Gradient-TD off-policy policy evaluation algorithms [Sutton et al.2009] which are guaranteed to be stable under function approximation, we propose a family of convergent planning algorithms. The proposed methods are guaranteed to converge for both linear and non-linear expectation models. This is true even if the models are imperfect, which is usually the case in model-based reinforcement learning where the models are learned online.

We consider an objective function similar to Mean Square Projected Bellman Error (MSPBE), which we call Model-Based Mean Square Projected Bellman Error (MB-MSPBE). Let . This objective can be minimized using a variety of gradient-based methods – as we will elaborate later. We call the family of methods optimizing this objective Gradient-based Dyna-style Planning (GDP) methods.

One observation is that if MB-MSPBE is not strictly convex then minimizing it will give us infinite solutions for . Note that since features are assumed to be independent, this would mean that we have infinite different solutions for the approximate value function and some of them might even have unbounded components. Note that this is also true for MSPBE objective. Similar to the GTD learning methods for MSPBE, we assume that the solution for minimizing MB-MSPBE is unique, denoted by . Note that this is true iff the Hessian , where and , is invertible. This is equivalent to being non-singular.

Assumption 6.1

is non-singular

It can be shown that the solution for minimizing this objective is , where , if the above assumption holds. Note that the solution is the same as the TD(0)-based planning’s solution if , but since GDP optimizes this objective by gradient descent, the numerical radius condition is not required anymore.

We note that there is an equivalence between MB-MSPBE and MSPBE, where . That is, if a best non-linear model is learned from the data generated from some policy and in the search control process equals ’s stationary feature vector distribution , then MB-MSPBE is the MSPBE.

Figure 2: GDP vs TD(0)-based planning in Baird’s counterexample

: GDP remains stable, whereas TD(0)-based planning algorithm diverges. The reported curve is the average of 10 runs and the standard deviation is too small to be visible clearly

Proposition 6.1


  1. Assumption 2.1 and 2.3 hold.

  2. and

Then .

Note that the proposition 6.1 does not hold for the best linear model for the same reason elaborated in proposition 5.1.

Let’s now consider algorithms that can be used to minimize this objective. Consider the gradient of MB-MSPBE.

. Note that in the above expression, we have a product of three expectations and, therefore, we cannot simply use one sample to obtain an unbiased estimate of the gradient. In order to obtain this estimate, we could either draw three independent samples or learn the product of the the last two factors using a linear least-square method. GTD methods take the second route leading to an algorithm with

complexity in which two sets of parameters are mutually related in their updates. However, if one uses a linear model, the computational complexity for storing and using the model is already . For a non-linear model, it can be either smaller or greater than depending on the parameterization choices. Thus, a planning algorithms with complexity can be an acceptable choice. This leads to two choices: we can sample the three expectations and and then combine them to produce an unbiased estimate of the gradients. Note that this would still lead to an algorithm as the matrix inversion can be done in using Sherman-Morrison formula. The second choice is to use the linear least-square method to estimate the first two expectations and sample the last one. In this case, there are still two set parameters but their updates are not mutually dependent. This can potentially lead to faster convergence. Although both of these approaches have complexity, we adopt the second approach, which is summarized in algorithm 1. We now present the convergence theorem for the proposed algorithm, which is followed by its empirical evaluation. The reader can refer to the supplementary material for the proof of the theorem.

Input: , policy , feature vector distribution , expectation model , stepsizes for

1:  for  do
2:     Sample
3:     Sample
5:  end for
Algorithm 1 GDP Algorithm
Theorem 6.1 (Convergence of GDP Algorithm)

Consider algorithm 1. If

  1. Assumptions 2.2, 2.3 and 6.1 hold

  2. , ,

Then for any initial weight vector , w.p. 1.

Figure 3: Planning with GDP: The algorithm remains stable and converges to the off-policy LSTD solution in both (left) four-room and (right) stochastic mountain car. The reported curve is the average of 10 runs and the standard deviation is too small to be visible clearly

7 Experiments

The goal of the experiment section is to validate the theoretical results and investigate how GDP algorithm performs in practice. Concretely, we seek to answer the following questions: 1) is the proposed planning algorithm stable for the non-linear model choice, especially when the model is learned online and 2) what solution does the proposed planning algorithm converge to.

7.1 Divergence in TD(0)-based Planning

Our first experiment is designed to illustrate the divergence issue with TD(0)-based planning update. We use Baird’s counterexample [Baird1995, Sutton and Barto2018] with the same dynamics and reward function - a classic example to highlight the off-policy divergence problem with the model-free TD(0) algorithm. The policy used to learn the model is arranged to be the same as the behavior policy in the counterexample, whereas the policy to be evaluated is arranged to be the same as the counterexample’s target policy. For TD(0) with linear model, we initialize the matrix and vector for all

to be zero. For GDP, we use a neural network with one hidden layer of 200 units as the non-linear model. We initialize the non-linear model using Xavier initialization

[Glorot and Bengio2010]. The parameter for the estimated value function is initialized as proposed in the counterexample. The model is learned in an online fashion, that is, we use only the most recent sample to perform a gradient-descent update on the mean-square error. The search-control process is also restricted to generate the last-seen feature vector, which is then used with an

to simulate the next feature vector. The resulting simulated transition is used to apply the planning update. The evaluation metric is the

Root Mean Square Value Error (RMSVE):

. The results are reported for hyperparameters chosen based on RMSVE over the latter half of a run. In Figure

2, we see that TD(0) updates with the linear expectation model cause the value function to diverge. In contrast, GDP-based planning algorithm remains sound and converges to the RMSVE of 2.0. Interestingly, stable model-free methods also converge to the same RMSVE value (not shown here) [Sutton and Barto2018].

7.2 Convergence in Practice

In this set of experiments, we want to investigate how the GDP algorithm performs in practice. We evaluate the proposed method for the non-linear model choice in two simple yet illustrative domains: Four Room [Sutton et al.1999, Ghiassian et al.2018] and a stochastic variant of Mountain Car [Sutton1996]. Similar to the previous experiment, the model is learned online. Search control, however, samples uniformly from the recently visited 1000 feature vectors to approximate the i.i.d. assumption in Theorem 6.1.

We modified the Four Room domain by changing the states on the four corners to terminal states. The reward is zero everywhere except when the agent transitions into a terminal state, where the reward is one. The episode starts in one of the non-terminal states uniform randomly. The policy to generate the data for learning the model takes all actions with equal probability, whereas the policy to be evaluated constitutes the shortest path to the top-left terminal state and is deterministic. We used tile coding [Sutton1996] to obtain features( tilings). In mountain car, the policy used to generate the data is the standard energy-pumping policy with 50 randomness [Le et al.2017], where the policy to be evaluated is also the standard energy-pumping policy but with no randomness. We again used tile coding to obtain features ( tilings). We inject stochasticity in the environment by only executing the chosen action of the times, whereas a random action is executed 30 of the time. In both experiments, we do one planning step for each sample collected by the policy . As noted in proposition 6.1, if we have , the minimizer of MB-MSPBE for the best non-linear model is the off-policy LSTD solution , [Yu2010]. Therefore, for both domains, we run the off-policy LSTD algorithm for 2 million time-steps and use the resulting solution as the evaluation metric: .

The results are reported for hyper-parameters chosen according to the LSTD-solution based based loss over the latter half of a run. In Figure 3, we see that GDP remains stable and converges to off-policy LSTD solution in both domains.

8 Conclusion

In this paper, we proposed a sound way of using the expectation models for planning and showed that it is equivalent to planning with distribution models if the state value function is linear in state-features. We made a theoretical argument for non-linear expectation models to be the parametrization of choice even if the value-function is linear. Lastly, we proposed GDP, a model-based policy evaluation algorithm with convergence guarantees, and empirically demonstrated its effectiveness.

9 Acknowledgement

We would like to thank Huizhen Yu, Sina Ghiassian and Banafsheh Rafiee for useful discussions and feedbacks.