1 Introduction
Learning models of the world and effectively planning with them remains a longstanding challenge in artificial intelligence. Modelbased reinforcement learning formalizes this problem in the context of reinforcement learning where the model refers to the environment’s transition dynamics and reward function. Once the model of the environment has been learned, an agent can potentially use it to arrive at plans without needing to interact with the environment.
The output of the model is one of the key choices in the design of a planning agent, as it determines the way the model is used for planning. Should the model produce 1) a distribution over the next state feature vector, 2) a sample of the next state feature vector, or 3) the expected next state feature vector? For stochastic environments, distribution and sample models can be used effectively, particularly if the distribution can be assumed to be of a special form
[Deisenroth and Rasmussen2011, Chua et al.2018]. For arbitrarily stochastic environments, learning a sample or distribution model could be intractable or even impossible. For deterministic environments, expectation models appear to be the default choice as they are easier to learn and have been used [Oh et al.2015, Leibfried et al.2016]. However, for general stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we develop an approach to use expectation models for arbitrarily stochastic environments by restricting the value function to be linear in the statefeature vector.Once the choice of expectation models with linear value function has been made, the next question is to develop an algorithm which uses the model for planning. In previous work, planning methods have been proposed which use expectation models for policy evaluation [Sutton et al.2012]. However, as we demonstrate empirically, the proposed methods require strong conditions on the model which might not hold in practice, causing the value function to diverge to infinity. Thus, a key challenge is to devise a sound planning algorithm which uses approximate expectation models for policy evaluation and has convergence guarantees. In this work, we propose a new objective function called Model BasedMean Square Projected Bellman Error (MBMSPBE) for policy evaluation and show how it relates to Mean Square Projected Bellman Error (MSPBE) [Sutton et al.2009]. We derive a planning algorithm which minimizes the proposed objective and show its convergence under conditions which are milder than the ones assumed in the previous work [Sutton et al.2012].
It is important to note that in this work, we focus on the value prediction task for modelbased reinforcement learning. Predicting the value of a policy is an integral component of Generalized Policy Iteration(GPI) on which much of modern reinforcement learning control algorithms are built [Sutton and Barto2018]. Policy evaluation is also key for building predictive world knowledge where the questions about the world are formulated using value functions [Sutton et al.2011, Modayil et al.2014, White and others2015]. More recently, policy evaluation has also been shown to be useful for representation learning where value functions are used as auxiliary tasks [Jaderberg et al.2016]. While modelbased reinforcement learning for policy evaluation is interesting in its own right, the ideas developed in this paper can also be extended to the control setting.
2 Problem Setting
We formalize an agent’s interaction with its environment by a finite Markov Decision Process (MDP) defined by the tuple
, where is a set of states, is a set of actions, is a set of rewards, model such that and , and is the discount factor. A stationary policy determines the behavior of the agent. The value function describes the expected discounted sum of rewards obtained by following policy . In this work, we assume the discounted bounded reward problem setting, i.e., and .In practice, the agent does not have access to the states directly, but only through an dimensional realvalued feature vector where is the feature mapping, which can be an arbitarily complex function for extracting the statefeatures. Tilecoding [Sutton1996] and Fourier basis [Konidaris et al.2011] are examples of statefeature mapping functions which are expert designed. An alternative is to learn the mapping using auxiliary tasks and approximate the value function using the learned statefeatures [Chung et al.2018, Jaderberg et al.2016]. In that case the value function is usually approximated using a parametrized function with an dimensional weight vector , where typically . We write for the approximate value of state . The approximate value function can either be a linear function of the statefeatures where , or a nonlinear function where is an arbitrary function. Similarly, it’s common to use the state feature vector as the input of the policy and both input and output of the approximate model, which we will discuss in the next section.
The Dyna architecture [Sutton1991]
is an MBRL algorithm which unifies learning, planning, and acting via updates to the value function. The agent interacts with the world, using observed state, action, next state, and reward tuples to estimate the model
, and update an estimate of the actionvalue function for policy . The planning step in Dyna repeatedly samples possible next state, and rewards from the model, given input stateaction pairs. These hypothetical experiences can be used to update the actionvalue function, just as if they had been generated by interacting with the environment. The Search Control procedure decides what states and actions are used to query the model during planning. The efficiency of planning can be significantly improved with nonuniform search control such as prioritized sweeping [Moore and Atkeson1993, Sutton et al.2012, Pan et al.2018]. In the function approximation setting, there are three factors that can affect the solution of a planning algorithm: 1) the distribution of data used to train the model 2) the search control process’s distribution for selecting the starting feature vectors and actions for simulating the next feature vectors, and 2) the policy being evaluated.Consider an agent wanting to evaluate a policy , i.e., approximate , using a Dynastyle planning algorithm. Assume that the data used to learn the model come from the agent’s interaction with the environment using policy
. It is common to have an ergodicity assumption on the markov chain induced by
, i.e.,Assumption 2.1
The markov chain induced by policy b is ergodic.
Under this assumption we can define the expectation in terms of the unique stationary distribution, for example, .
Let denote ’s stationary state distribution and let as the set of all states sharing feature vector . Consequently, the stationary feature vector distribution corresponding to would be . Let’s suppose the search control process generates a sequence of i.i.d random vectors where each follows distribution , and chooses actions according to policy i.e. , which is the policy to be evaluated. The is usually assumed to be bounded.
Assumption 2.2
is bounded.
Since we assumed the finite MDP setting, the number of states, actions and feature vectors are all finite and the model output is, therefore, always bounded. For the uniqueness of the solution, it is also assumed that the feature vectors generated by the searchcontrol process are linearly independent:
Assumption 2.3
is nonsingular
3 Approximate Models
In the function approximation setting, an approximate model of the transition dynamics and the reward function is used for planning. The approximate model may map a statefeature vector and an action to either a distribution over the next statefeature vectors and rewards (distribution model), a sample from the distribution over the nextstate feature vectors and rewards (sample model), or to the expected next statefeature vector and reward (expectation model). For the sake of brevity, we refer to the approximate models as just models, from here onwards.
A distribution model takes a statefeature vector and action as input and produces a distribution over the nextstate feature vectors and rewards. Distribution models are deterministic as the input of the model completely determines the output. Distribution models have typically been used with special forms such a Gaussians [Chua et al.2018] or Gaussian processes [Deisenroth and Rasmussen2011]. In general, however, learning a distribution can be impractical as distributions are potentially large objects. For example, if the state is represented by a feature vector of dimension
, then the first moment of its distribution is a
vector, but the second moment is a matrix, and the third moment is , and so on.Sample models are a more practical alternative, as they only need to generate a sample of the nextstate feature vector and reward, given a statefeature vector and action. Sample models can use arbitrary distributions to generate the samples—even though they do not explicitly represent those distributions—but can still produce objects of a limited size (e.g. feature vectors of dimension ). They are particularly well suited for samplebased planning methods such as Monte Carlo Tree Search [Coulom2006]. Unlike distribution models, however, sample models are stochastic which creates an additional branching factor in planning, as multiple samples are needed to be drawn to gain a representative sense of what might happen.
Expectation models are an even simpler approach, where the model produces the expectation
of the nextstate feature vector and reward. The advantages of expectation models are that the state output is compact (like a sample model) and deterministic (like a distribution model). The potential disadvantage of an expectation model is that it is only a partial characterization of the distribution. For example, if the result of an action (or option) is that two binary state features both occur with probability 0.5, but are never present (=1) together, then an expectation model can capture the first part (each present with probability 0.5), but not the second (never both present). This may not be a substantive limitation, however, as we can always add a third binary state feature, for example, for the AND of the original two features, and then capture the full distribution with the expectation of all three state features.
4 Expectation Models and Linearity
Expectation models can be less complex than distribution and sample models and, therefore, can be easier to learn. This is especially critical for modelbased reinforcement learning where the agent is to learn a model of the world and use it for planning. In this work, we focus on answering the question: how can expectation models be used for planning in Dyna, despite the fact that they are only a partial characterization of the transition dynamics?
There is a surprisingly simple answer to this question: if the value function is linear in the state features, then there is no loss of generality when using an expectation model for planning. Consider the dynamic programming update for policy evaluation with an distribution model , for a given feature vector and action
(1)  
(2) 
The second equation uses an expectation model corresponding to the distribution model. This result shows that no generality has been lost by using an expectation model, if the value function is linear. Further, the same equations also advocate the other direction: if we are using an expectation model, then the approximate value function should be linear. This is because (2) is unlikely to equal (1) for general distributions if is linear in statefeatures.
It is important to point out that linearity does not particularly restrict the expressiveness of the value function since the mapping could still be nonlinear and, potentially, learned endtoend using auxiliary tasks [Jaderberg et al.2016, Chung et al.2018].
5 Linear NonLinear Expectation Models
We now consider the parameterization of the expectation model: should the model be a linear projection from statefeatures to the next statefeatures or should it be an arbitrary nonlinear function? In this section, we discuss the two common choices and their implications in detail.
We assume a mapping for statefeatures and a value function which is linear in statefeatures. An approximate expectation model consists of a dynamics function and a reward function constructed such that and can be used as estimates of the expected feature vector and reward that follow from when an action is taken. The general case that and are arbitrary nonlinear functions is what we call the nonlinear expectation model. A special case is of the linear expectation model in which both of these functions are linear, i.e., and where is the transition matrix and is the expected reward vector.
We now define the best linear expectation model and the best nonlinear expectation model trained using data generated by policy . In particular, let be the best linear expectation model where
For the uniqueness of the best linear model, we assume
Assumption 5.1
is nonsingular
Under this assumption we have closedfrom solution for linear model.
The best nonlinear expectation model .
Both linear and nonlinear models can be learned using samples via stochastic gradient descent.
5.1 Why Linear Models are Not Enough?
In previous work, linear expectation models have been used to simulate a transition and execute TD(0) update [Sutton et al.2012]. Convergence to the TDfixed point using TD(0) updates with a nonaction linear expectation model is shown in theorem 3.1 and 3.3 of [Sutton et al.2012]. An additional benefit of this method is that the point of convergence does not rely on the distribution of the searchcontrol process. Critically, a nonaction model cannot be used for evaluating an arbitrary policy, as it is tied to a single policy – the one that generates the data for learning the model. To evaluate multiple policies, an action model is required. In this case, the point of convergence of the algorithm is dependent on . From corollary 5.1 of [Sutton et al.2012], the convergent point of TD(0) update with action model is:
(3) 
where . It is obvious that the convergence point changes as the feature vector generating distribution changes. We now ask even if equals , do the TD(0)based planning updates converge to the TD fixedpoint. In the next proposition we show that this is not true in general for the best linear model, however, it is true for the best nonlinear model.
Let the TDfixed point with real environment be , with the bestlinear model be , and with the best nonlinear model be (assuming they exist). We can write their expressions as follow:
(4)  
5.2 An Illustrative Example on the Limitation of Linear Models
In order to clearly elucidate the limitation of linear models for planning, we use a simple twostate MDP, as outlined in Figure 1. The policy used to generate the data for learning the model, and the the policy to be evaluated are also described in Figure 1. We learn a linear model with the data collected by interacting with the real system using policy and verify that it is the best linear model that could be obtained. We can then obtain using equation(3). The solution of the real system is then calculated by offpolicy LSTD[Yu2010] using the same data that is used to learn the linear model. In agreement to proposition 5.1, the two resulting fixed points are considerably different: and .
Previous works [Parr et al.2008, Sutton et al.2012] showed that a nonaction linear expectation model could just be enough if the value function is linear in features. Proposition 5.1 coupled with the above example suggests that this is not true for the more general case of linear expectation models, and expressive nonlinear models could potentially be a better choice for planning with expectation models. From now on, we focus on nonlinear models as the parametrization of choice for planning with expectation models.
6 Gradientbased Dynastyle Planning (GDP) Methods
In the previous section, we established that more expressive nonlinear models are needed to recover the solution obtained by the real system. An equally crucial choice is that of the planning algorithm: do TD(0) planning updates converge to the fixedpoint? We note that for this to be true in case of linear models, we require the numerical radius of to be less than 1 [Sutton et al.2012]. We conjecture that this condition might not hold in practice causing the planning to diverge. We illustrate this point using the Baird’s Counter Example [Baird1995] in the next section.
We also see from proposition 5.1, that the expected TD(0) planning update with the best nonlinear model is the same as the expected modelfree TD(0) update , where and . We know that for offpolicy learning, TD(0) is not guaranteed to be stable. This suggests that even with the best nonlinear model, the TD(0)based planning algorithm is also prone to divergence.
Inspired by the GradientTD offpolicy policy evaluation algorithms [Sutton et al.2009] which are guaranteed to be stable under function approximation, we propose a family of convergent planning algorithms. The proposed methods are guaranteed to converge for both linear and nonlinear expectation models. This is true even if the models are imperfect, which is usually the case in modelbased reinforcement learning where the models are learned online.
We consider an objective function similar to Mean Square Projected Bellman Error (MSPBE), which we call ModelBased Mean Square Projected Bellman Error (MBMSPBE). Let . This objective can be minimized using a variety of gradientbased methods – as we will elaborate later. We call the family of methods optimizing this objective Gradientbased Dynastyle Planning (GDP) methods.
One observation is that if MBMSPBE is not strictly convex then minimizing it will give us infinite solutions for . Note that since features are assumed to be independent, this would mean that we have infinite different solutions for the approximate value function and some of them might even have unbounded components. Note that this is also true for MSPBE objective. Similar to the GTD learning methods for MSPBE, we assume that the solution for minimizing MBMSPBE is unique, denoted by . Note that this is true iff the Hessian , where and , is invertible. This is equivalent to being nonsingular.
Assumption 6.1
is nonsingular
It can be shown that the solution for minimizing this objective is , where , if the above assumption holds. Note that the solution is the same as the TD(0)based planning’s solution if , but since GDP optimizes this objective by gradient descent, the numerical radius condition is not required anymore.
We note that there is an equivalence between MBMSPBE and MSPBE, where . That is, if a best nonlinear model is learned from the data generated from some policy and in the search control process equals ’s stationary feature vector distribution , then MBMSPBE is the MSPBE.
Note that the proposition 6.1 does not hold for the best linear model for the same reason elaborated in proposition 5.1.
Let’s now consider algorithms that can be used to minimize this objective. Consider the gradient of MBMSPBE.
. Note that in the above expression, we have a product of three expectations and, therefore, we cannot simply use one sample to obtain an unbiased estimate of the gradient. In order to obtain this estimate, we could either draw three independent samples or learn the product of the the last two factors using a linear leastsquare method. GTD methods take the second route leading to an algorithm with
complexity in which two sets of parameters are mutually related in their updates. However, if one uses a linear model, the computational complexity for storing and using the model is already . For a nonlinear model, it can be either smaller or greater than depending on the parameterization choices. Thus, a planning algorithms with complexity can be an acceptable choice. This leads to two choices: we can sample the three expectations and and then combine them to produce an unbiased estimate of the gradients. Note that this would still lead to an algorithm as the matrix inversion can be done in using ShermanMorrison formula. The second choice is to use the linear leastsquare method to estimate the first two expectations and sample the last one. In this case, there are still two set parameters but their updates are not mutually dependent. This can potentially lead to faster convergence. Although both of these approaches have complexity, we adopt the second approach, which is summarized in algorithm 1. We now present the convergence theorem for the proposed algorithm, which is followed by its empirical evaluation. The reader can refer to the supplementary material for the proof of the theorem.Input: , policy , feature vector distribution , expectation model , stepsizes for
Output:
7 Experiments
The goal of the experiment section is to validate the theoretical results and investigate how GDP algorithm performs in practice. Concretely, we seek to answer the following questions: 1) is the proposed planning algorithm stable for the nonlinear model choice, especially when the model is learned online and 2) what solution does the proposed planning algorithm converge to.
7.1 Divergence in TD(0)based Planning
Our first experiment is designed to illustrate the divergence issue with TD(0)based planning update. We use Baird’s counterexample [Baird1995, Sutton and Barto2018] with the same dynamics and reward function  a classic example to highlight the offpolicy divergence problem with the modelfree TD(0) algorithm. The policy used to learn the model is arranged to be the same as the behavior policy in the counterexample, whereas the policy to be evaluated is arranged to be the same as the counterexample’s target policy. For TD(0) with linear model, we initialize the matrix and vector for all
to be zero. For GDP, we use a neural network with one hidden layer of 200 units as the nonlinear model. We initialize the nonlinear model using Xavier initialization
[Glorot and Bengio2010]. The parameter for the estimated value function is initialized as proposed in the counterexample. The model is learned in an online fashion, that is, we use only the most recent sample to perform a gradientdescent update on the meansquare error. The searchcontrol process is also restricted to generate the lastseen feature vector, which is then used with anto simulate the next feature vector. The resulting simulated transition is used to apply the planning update. The evaluation metric is the
Root Mean Square Value Error (RMSVE):. The results are reported for hyperparameters chosen based on RMSVE over the latter half of a run. In Figure
2, we see that TD(0) updates with the linear expectation model cause the value function to diverge. In contrast, GDPbased planning algorithm remains sound and converges to the RMSVE of 2.0. Interestingly, stable modelfree methods also converge to the same RMSVE value (not shown here) [Sutton and Barto2018].7.2 Convergence in Practice
In this set of experiments, we want to investigate how the GDP algorithm performs in practice. We evaluate the proposed method for the nonlinear model choice in two simple yet illustrative domains: Four Room [Sutton et al.1999, Ghiassian et al.2018] and a stochastic variant of Mountain Car [Sutton1996]. Similar to the previous experiment, the model is learned online. Search control, however, samples uniformly from the recently visited 1000 feature vectors to approximate the i.i.d. assumption in Theorem 6.1.
We modified the Four Room domain by changing the states on the four corners to terminal states. The reward is zero everywhere except when the agent transitions into a terminal state, where the reward is one. The episode starts in one of the nonterminal states uniform randomly. The policy to generate the data for learning the model takes all actions with equal probability, whereas the policy to be evaluated constitutes the shortest path to the topleft terminal state and is deterministic. We used tile coding [Sutton1996] to obtain features( tilings). In mountain car, the policy used to generate the data is the standard energypumping policy with 50 randomness [Le et al.2017], where the policy to be evaluated is also the standard energypumping policy but with no randomness. We again used tile coding to obtain features ( tilings). We inject stochasticity in the environment by only executing the chosen action of the times, whereas a random action is executed 30 of the time. In both experiments, we do one planning step for each sample collected by the policy . As noted in proposition 6.1, if we have , the minimizer of MBMSPBE for the best nonlinear model is the offpolicy LSTD solution , [Yu2010]. Therefore, for both domains, we run the offpolicy LSTD algorithm for 2 million timesteps and use the resulting solution as the evaluation metric: .
The results are reported for hyperparameters chosen according to the LSTDsolution based based loss over the latter half of a run. In Figure 3, we see that GDP remains stable and converges to offpolicy LSTD solution in both domains.
8 Conclusion
In this paper, we proposed a sound way of using the expectation models for planning and showed that it is equivalent to planning with distribution models if the state value function is linear in statefeatures. We made a theoretical argument for nonlinear expectation models to be the parametrization of choice even if the valuefunction is linear. Lastly, we proposed GDP, a modelbased policy evaluation algorithm with convergence guarantees, and empirically demonstrated its effectiveness.
9 Acknowledgement
We would like to thank Huizhen Yu, Sina Ghiassian and Banafsheh Rafiee for useful discussions and feedbacks.
References
 [Baird1995] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier, 1995.
 [Chua et al.2018] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pages 4759–4770, 2018.
 [Chung et al.2018] Wesley Chung, Somjit Nath, Ajin Joseph, and Martha White. Twotimescale networks for nonlinear value function approximation. 2018.
 [Coulom2006] Rémi Coulom. Efficient selectivity and backup operators in montecarlo tree search. In International conference on computers and games, pages 72–83. Springer, 2006.
 [Deisenroth and Rasmussen2011] Marc Deisenroth and Carl E Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472, 2011.
 [Ghiassian et al.2018] Sina Ghiassian, Andrew Patterson, Martha White, Richard S. Sutton, and Adam White. Online offpolicy prediction. CoRR, abs/1811.02597, 2018.
 [Glorot and Bengio2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
 [Jaderberg et al.2016] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 [Konidaris et al.2011] George Konidaris, Sarah Osentoski, and Philip Thomas. Value function approximation in reinforcement learning using the fourier basis. In Twentyfifth AAAI conference on artificial intelligence, 2011.
 [Le et al.2017] Lei Le, Raksha Kumaraswamy, and Martha White. Learning sparse representations in reinforcement learning with sparse coding. arXiv preprint arXiv:1707.08316, 2017.
 [Leibfried et al.2016] Felix Leibfried, Nate Kushman, and Katja Hofmann. A deep learning approach for joint video frame and reward prediction in atari games. arXiv preprint arXiv:1611.07078, 2016.
 [Modayil et al.2014] Joseph Modayil, Adam White, and Richard S Sutton. Multitimescale nexting in a reinforcement learning robot. Adaptive Behavior, 22(2):146–160, 2014.
 [Moore and Atkeson1993] Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1):103–130, 1993.
 [Oh et al.2015] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Actionconditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, pages 2863–2871, 2015.
 [Pan et al.2018] Yangchen Pan, Muhammad Zaheer, Adam White, Andrew Patterson, and Martha White. Organizing experience: a deeper look at replay mechanisms for samplebased planning in continuous state domains. arXiv preprint arXiv:1806.04624, 2018.

[Parr et al.2008]
Ronald Parr, Lihong Li, Gavin Taylor, Christopher PainterWakefield, and
Michael L Littman.
An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning.
In Proceedings of the 25th international conference on Machine learning, pages 752–759. ACM, 2008.  [Sutton and Barto2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. 2018.
 [Sutton et al.1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 [Sutton et al.2009] Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 993–1000. ACM, 2009.
 [Sutton et al.2011] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and Doina Precup. Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In The 10th International Conference on Autonomous Agents and Multiagent SystemsVolume 2, pages 761–768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.
 [Sutton et al.2012] Richard S Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael P Bowling. Dynastyle planning with linear function approximation and prioritized sweeping. arXiv preprint arXiv:1206.3285, 2012.
 [Sutton1991] R.S. Sutton. Integrated modeling and control based on reinforcement learning and dynamic programming. In Advances in Neural Information Processing Systems, 1991.
 [Sutton1996] Richard S Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Advances in neural information processing systems, pages 1038–1044, 1996.
 [White and others2015] Adam White et al. Developing a predictive approach to knowledge. 2015.
 [Yu2010] Huizhen Yu. Convergence of least squares temporal difference methods under general conditions. In ICML, pages 1207–1214, 2010.