1 Introduction
Modelbased reinforcement learning (RL) has proven to be a powerful approach for generating rewardseeking behavior in sequential decisionmaking environments. For example, a number of methods are known for guaranteeing near optimal behavior in a Markov decision process (MDP) by adopting a modelbased approach
(Kearns & Singh, 1998; Brafman & Tennenholtz, 2002; Strehl et al., 2009). In this line of work, a learning agent continually updates its model of the transition dynamics of the environment and actively seeks out parts of its environment that could contribute to achieving high reward but that are not yet well learned. Policies, in this setting, are designed specifically to explore unknown transitions so that the agent will be able to exploit (that is, maximize reward) in the long run.A distinct modelbased RL problem is one in which an agent has explored its environment, constructed a model, and must then use this learned model to select the best policy that it can. A straightforward approach to this problem, referred to as the certainty equivalence approximation (Dayan & Sejnowski, 1996), is to take the learned model and to compute its optimal policy, deploying the resulting policy in the real environment. The promise of such an approach is that, for environments that are defined by relatively simple dynamics but require complex behavior, a modelbased learner can start making highquality decisions with little data.
Nevertheless, recent largescale successes of reinforcement learning have not been due to modelbased methods but instead derive from valuefunction based or policysearch methods (Mnih et al., 2015, 2016; Schulman et al., 2017; Hessel et al., 2018). Attempts to leverage modelbased methods have fallen below expectations, particularly when models are learned using functionapproximation methods. Jiang et al. (2015) highlighted a significant shortcoming of the certainty equivalence approximation, showing that it is important to hedge against possibly misleading errors in a learned model. They found that reducing the effective planning depth by decreasing the discount factor used for decision making can result in improved performance when operating in the true environment.
At first, this result might seem counter intuitive—the best way to exploit a learned model can be to exploit it incompletely. However, an analogous situation arises in supervised machine learning. It is well established that, particularly when data is sparse, the representational capacity of supervised learning methods must be restrained or regularized to avoid overfitting. Returning the best hypothesis in a hypothesis class relative to the training data can be problematic if the hypothesis class is overly expressive relative to the size of the training data. The classic result is that testing performance improves, plateaus, then drops as the complexity of the learner’s hypothesis class is increased.
In this paper, we extend the results on avoiding planner overfitting via decreasing discount rates by introducing several other ways of regularizing policies in modelbased RL. In each case, we see the classic “overfitting” pattern in which resisting the urge to treat the learned model as correct and to search in a reduced policy class is repaid by improved performance in the actual environment. We believe this research direction may hold the key to largescale applications of modelbased RL.
Section 2 provides a set of definitions, which provide a vocabulary for the paper. Section 3 reviews the results on decreasing discount rates, Section 4 presents a new approach that plans using epsilon greedy policies, and Section 5 presents results where policysearch is performed using lower capacity representations of policies. Section 6 summarizes related work and Section 7 concludes.
2 Definitions
An MDP is defined by the quantities , where is a state space, is an action space, is a reward function, is a transition function, and is a discount factor. The notation
represents the set of probability distributions over the discrete set
. Given an MDP , its optimal value function is the solution to the Bellman equation:This function is unique and can be computed by algorithms such as value iteration or linear programming
(Puterman, 1994).A (deterministic) policy is a mapping from states to actions, . Given a value function , the greedy policy with respect to is . The greedy policy with respect to maximizes expected discounted reward from all states. We assume that ties between actions of the greedy policy are broken arbitrarily but consistently so there is always a unique optimal policy for any MDP.
The value function for a policy deployed in can be found by solving
The value function of the optimal policy is the optimal value function. For a policy , we also define the scalar , where is an MDPspecific weighting function over the states.
The epsilongreedy policy (Sutton & Barto, 1998) is a stochastic policy where the probability of choosing action is if and otherwise. The optimal epsilon greedy policy for is not generally the epsilon greedy policy for . Instead, it is necessary to solve a different set of Bellman equations:
The optimal epsilongreedy policy plays an important role in the analysis of learning algorithms like SARSA (Rummery, 1994; Littman & Szepesvári, 1996).
These examples of optimal policies are with respect to all possible deterministic Markov policies. In this paper, we also consider optimization with respect to a restricted set of policies . The optimal restricted policy can be found by comparing the scalar values of the policies:
3 Decreased Discounting
Let be the evaluation environment and be the planning environment, where is the learned model and is a smaller discount factor used to decrease the effective planning horizon.
Jiang et al. (2015) proved a bound on the difference between the performance of the optimal policy in and the performance of the optimal policy in when executed in :
(1) 
Here, is the largest reward (we assume all rewards are nonnegative), is the certainty with which the bound needs to hold, is the number of samples of each transition used to build the model, and is the number of distinct possibly optimal policies for over the entire space of possible transition functions.
They show that is an increasing function of , growing from 1 to as high as , the size of the set of all possible deterministic policies. They left open the shape of this function, which is most useful if it grows gradually, but could possibly jump abruptly.
To help ground intuitions, we estimated the shape of
over a set of randomly generated MDPs. Following Jiang et al. (2015), a “tenstate chain” MDP is drawn such that, for each state–action pair, , the transition function is constructed by choosing states at random from , then assigning probabilities to these states by drawingindependent samples from a uniform distribution over
and normalizing the resulting numbers. The probability of transition to any other state is zero. For each state–action pair , the reward is drawn from a uniform distribution with support . For our MDPs, we chose , and . We examined in , computed optimal policies by running value iteration with iterations. We sampled repeatedly until no new optimal policy was discovered for consecutive samples.Figure 1 is an estimate of how grows in this class of randomly generated MDPs. Fortunately, the set appears to grow gradually, making an effective parameter for fighting planner overfitting.
Estimating , Figure 2 shows the bound of Equation 1 applied to the random MDP distribution (, , , ).
Note that the expected “U” shape is visible, but only for a relatively narrow range of values of . For under k samples, the minimal loss bound is achieved for . For over k samples, the minimal loss bound is achieved for . (Note that the pattern shown here is relatively insensitive to the estimated shape of .)
For actual MDPs, the “U” shape is much more robust. Using this same distribution over MDPs, Figure 3 replicates an empirical result of Jiang et al. (2015) showing that intermediate values of are most successful and that this value grows as the model used in planning becomes more accurate (having been trained on more trajectories). We sampled MDPs from the same random distribution and, for each value of , we generated datasets each consisting of trajectories of length starting from a state selected uniformly at random and executing a random policy. In all experiments, the estimated MDP () was computed using maximum likelihood estimates of and with no additive Gaussian noise. Optimal policies were all found by running value iteration in the estimated MDP . The empirical loss (Equation 14 of Jiang et al. (2015)) was computed for each value of . The error bars shown in the figure represent
4 Increased Exploration
In this section, we consider a novel regularization approach in which planning is performed over the set of epsilongreedy policies. The intuition here is that adding noise to the policies makes it harder for them to be tailored explicitly to the learned model, resulting in less planner overfitting.
In Section 4.1, a general bound is introduced and then Section 4.2 applies the bound to the set of epsilon greedy policies.
4.1 General Bounds
We can relate the structure of a restricted set of policies to the performance in an approximate model with the following theorem.
Theorem 1.
Let be a set of policies for an MDP . Let be an MDP like , but with a different transition function. Let be the optimal policy for and be the optimal policy for . Let be the optimal policy in for and be the optimal policy in for . Then,
Proof.
Theorem 1 shows that the restricted policy set impacts the resulting value of the plan in two ways. First, the bigger the class is, the closer becomes to —that is, the more policies we consider, the closer to optimal we become. At the same time, grows as gets larger as there are more policies that can differ in value between and .
Jiang et al. (2015) leverage this structure in the specific case of defining by optimizing policies using a smaller value for . Our Theorem 1 generalizes the idea to arbitrary restricted policy classes and arbitrary pairs of MDPs and .
In particular, Consider a sequence of such that . Then, the first part of the bound is monotonically nonincreasing (it goes down each time a better policy is included in the set) and the second part of the bound is monotonically nondecreasing (it goes up each time a policy is included that magnifies the difference in performance possible in the two MDPs).
In Lemma 1, we show that the particular choice of that comes from statistically sampling transitions as in certainty equivalence leads to a bound on , for an arbitrary policy .
Lemma 1.
Given true MDP , let be an MDP comprised of a reward function and transition function estimated from samples for each state–action pair, and let be a policy, then the following holds with probability at least :
Proof.
This lemma is a variation of the classic “Simulation Lemma” (Kearns & Singh, 1998; Strehl et al., 2009) and is proven in this form as Theorem 2 of Jiang et al. (2015). Note that their proof, though stated with respect to a particular choice of set, holds in this general form. ∎
4.2 Bound for EpsilonGreedy Policies
It remains to show that is bounded when restricted to epsilongreedy policies. For the case of planning with a decreased discount factor, Jiang et al. (2015) provide a bound for this quantity in their Lemma 1. For the case of epsilongreedy policies, the corresponding bound is proven in the following lemma.
Lemma 2.
For any MDP , the difference in value of the optimal policy and the optimal greedy policy is bounded by:
Proof.
Let be the optimal policy for and be a policy that selects actions uniformly at random. We can define , an greedy version of , as:
(4) 
where refers to the probability associated with action under a policy . Let denote the transition matrix from states to states under policy . Using the above definition, we can decompose the transition matrix into
(5) 
Similarly, we have for the reward vector over states,
(6) 
To obtain our bound, note
(7)  
Since is a transition matrix, all its entries lie in ; hence, we have the following elementwise matrix inequality:
(8) 
Plugging inequality 8 into the bound 7 results in
Since we can upper bound the norm of difference of the values vector over states with
Using this inequality, we can bound the difference in value of the optimal policy and the optimal greedy policy by
∎
Figure 4 is an estimate of how grows over the class of randomly generated MDPs. Again, the set appears to grow gradually as decreases, making another effective parameter for fighting planner overfitting.
4.3 Empirical Results
We evaluated this explorationbased regularization approach in the distribution over MDPs used in Figure 3. Figure 5 shows results for each value of . Here, the maximum likelihood transition function was replaced with the epsilonsoftened transition function . In contrast to the previous figure, regularization increases as we go to the right. Once again, we see that intermediate values of are most successful and the best value of decreases as the model used in planning becomes more accurate (having been trained on more trajectories). The similarity to Figure 3 is striking—in spite of the difference in approach, it is essentially the mirror image of Figure 3.
We see that manipulating either or can be used to modulate the impact of planner overfitting. Which method to use in practice depends on the particular planner being used and how easily it is modified to use these methods.
5 Decreased Policy Complexity
In addition to indirectly controlling policy complexity via and , it is possible to control for the complexity via the representation of the policy itself. In this section, we look at varying the complexity of the policy in the context of modelbased RL in which a model is learned and then a policy for that model is optimized via a policy search approach. Such an approach was used in the setting of helicopter control (Ng et al., 2003) in the sense that collected data in that work was used to build a model and a policy was constructed to optimize performance in this model (via policy search, in this case) and then deployed in the environment.
Our test domain was Lunar Lander, an environment with a continuous state space and discrete actions. The goal of the environment is to control a falling spacecraft so as to land gently in a target area. It consists of 8 state variables, namely the lander’s and coordinates, and velocities, angle and angular velocities, and two Boolean flags corresponding to whether each leg has touched down. The agent can take 4 actions, corresponding to which of its three thrusters (or no thruster) is active during the current time step. The Lunar Lander environment is publicly available as part of the OpenAI Gym Toolkit (Brockman et al., 2016).
We collected k step episodes of data on Lunar Lander. During data collection, decisions were made by a policygradient algorithm. Specifically, we ran the REINFORCE algorithm with the state–value function as the baseline (Williams, 1992; Sutton et al., 2000)
. For the policy and value networks, we used a single hidden layer neural network with 16 hidden units and relu activation functions. We used the Adam algorithm
(Kingma & Ba, 2014) with the default parameters and a step size of . The learned model was a 3layer neural net with ReLU activation functions mapping the agent’s state (8 inputs corresponding to 8 state variables) as well as a onehot representation of actions (4 inputs corresponding to 4 possible actions). The model consisted of two fully connected hidden layers with 32 units each and ReLU activations. We again used Adam and used step size to learn the model.We then ran policygradient RL (REINFORCE) as a planner using the learned model. The policy was represented by a neural network with a single hidden layer. To control the complexity of the policy representation, we varied the number of units in the hidden layer from to . Results were averaged over runs. Figure 6 shows that increasing the size of the hidden layer in the policy resulted in better and better performance on the learned model (top line). However, after 250 or so units, the resulting policy performed less well on the actual environment (bottom line). Thus, we see that reducing policy complexity serves as yet another way to reduce planner overfitting.
6 Related Work
Prior work has explored the use of regularization in reinforcement learning to mitigate overfitting. We survey some of the previous methods according to which function is regularized: (1) value, (2) model, or (3) policy.
6.1 Regularizing Value Functions
Many prior approaches have applied regularization to value function approximation, including Least Squares Temporal Difference learning (Bradtke & Barto, 1996), Policy Evaluation, and the batch approach of Fitted Iteration (FQI) (Ernst et al., 2005).
Kolter & Ng (2009) applied regularization techniques to LSTD (Bradtke & Barto, 1996) with an algorithm they called LARSTD. In particular, they argued that, without regularization, LSTD’s performance depends heavily on the number of basis functions chosen and the size of the data set collected. If the data set is too small, the technique is prone to overfitting. They showed that and regularization yield a procedure that inherits the benefits of selecting good features while making it possible to compute the fixed point. Later work by Liu et al. (2012) built on this work with the algorithm ROTD, an regularized off policy Temporal Difference Learning method. Johns et al. (2010) cast the regularized fixedpoint computation as a linear complementarity problem, which provides stronger solutionuniqueness guarantees than those provided for LARSTD. Petrik et al. (2010) examined the approximate linear programming (ALP) framework for finding approximated value functions in large MDPs. They showed the benefits of adding an regularization constraint to the ALP that increases the error bound at training time and helps fight overfitting.
Farahmand et al. (2008a) and Farahmand et al. (2009) focused on regularization applied to Policy Iteration and Fitted Iteration (FQI) (Ernst et al., 2005) and developed two related methods for Regularized Policy Iteration, each leveraging regularization during the evaluation of policies for each iteration. The first method adds a regularization term to the Least Squares Temporal Difference (LSTD) error (Bradtke & Barto, 1996), while the second adds a similar term to the optimization of Bellman residual minimization (Baird et al., 1995; Schweitzer & Seidmann, 1985; Williams & Baird, 1993) with regularization (Loth et al., 2007). Their main result shows finite convergence for the function under the approximated policy and the true optimal policy. A method for FQI adds a regularization cost to the least squares regression of the function. Follow up work (Farahmand et al., 2008b) expanded Regularized Fitted Iteration to planning. That is, given a data set and a function family (like regression trees), FQI approximates a function through repeated iterations of the following regression problem:
where imposes a regularization penalty term and is a regularization coefficient. They prove bounds relating this regularization cost to the approximation error in between iterations of FQI.
Farahmand & Szepesvári (2011) and Farahmand (2011) focused on a problem relevant to our approach—regularization for value selection in RL and planning. They considered an offline setting in which an algorithm, given a data set of experiences and set of possible functions, must choose a function from the set that minimizes the true Bellman error. They provided a general complexity regularization bound for model selection, which they applied to bound the approximation error for the function chosen by their proposed algorithm, BErMin.
6.2 Regularizing Models
In modelbased RL, regularizaion can be used to improve estimates of and when data is finite or limited.
Taylor & Parr (2009) investigated the relationship between Kernelized LSTD Xu et al. (2005) and other related techniques, with a focus on regularization in modelbased RL. Most relevant to our work is their decomposition of the Bellman error into transition and reward error, which they empirically show offers insight into the choice of regularization parameters.
Bartlett & Tewari (2009) developed an algorithm, Regal, with optimal regret for weakly communicating MDPs. Regal heavily relies on regularization; based on all prior experience, the algorithm continually updates a set that, with high probability, contains the true MDP. Letting denote the optimal perstep reward of the MDP , the traditional optimistic exploration tactic would suggest that the agent should choose the in with maximal . Regal also includes a regularization term to this maximization to prevent overfitting based on the experiences so far, resulting in stateoftheart regret bounds.
6.3 Regularizing Policies
The focus of applying regularization to policies is to limit the complexity of the policy class being searched in the planning process. It is this approach that we adopt in the present paper.
Somani et al. (2013) explored how regularization can help online planning for Partially Observable Markov Decision Processes (POMDPs). They introduced the Despot algorithm (Determinized Sparse Partially Observable Tree), which constructs a tree that models the execution of all policies on a number of sampled scenarios (rollouts). However, the authors note that Despot typically succumbs to overfitting, as a policy that performs well on the sampled scenarios is not likely to perform well in general. The work proposes a regularized extension of Despot, RDespot, where regularization takes the form of balancing between the performance of the policy on the samples with the complexity of the policy class. Specifically, RDespot imposes a regularization penalty on the utility of each node in the belief tree. The algorithm then computes the policy that maximizes regularized utility for the tree using a bottom up dynamic programming procedure on the tree. The approach is similar to ours in that it also limits policy complexity through regularization, but focuses on regularizing utility instead of regularizing the use of a transition model. Investigating the interplay between these two approaches poses an interesting direction for future work. In a similar vein, Thomas et al. (2015) developed a batch RL algorithm with a probabilistic performance guarantee that limits the complexity of the policy class as a means of regularization.
Petrik & Scherrer (2008) conducted analysis similar to Jiang et al. (2015). Specifically, they investigated the situations in which using a lowerthanactual discount factor can improve solution quality given an approximate model, noting that this procedure has the effect of regularizing rewards. The work also advanced the first bounds on the error of using a smaller discount factor.
7 Conclusion
For three different regularization methods—decreased discounting, increased exploration, and decreased policy complexity, we found a consistent Ushaped tradeoff between the size of the policy class being searched and its performance on a learned model. Future work will evaluate other methods such as drop out and early stopping.
The plots that varied and were quite similar, raising the possibility that perhaps epsilongreedy action selection is functioning as another way to decrease the effective horizon depth using in planning—chaining together random actions makes future states less predictable and therefore carry less weight. Later work can examine whether jointly choosing and is more effective than setting only one at a time.
More work is needed to identify methods that can learn in much larger domains (Bellemare et al., 2013). One concept worth considering is adapting regularization nonuniformly to the state space. That is, it should be possible to modulate the complexity of policies considered in parts of the state space where the model is more accurate, allowing more expressive plans is some places than others.
References
 Baird et al. (1995) Baird, Leemon et al. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the twelfth international conference on machine learning, pp. 30–37, 1995.

Bartlett & Tewari (2009)
Bartlett, Peter L. and Tewari, A.
REGAL: A regularization based algorithm for reinforcement learning
in weakly communicating MDPs.
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, pp. 35–42, 2009.  Bellemare et al. (2013) Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
 Bradtke & Barto (1996) Bradtke, Steven J and Barto, Andrew G. Linear leastsquares algorithms for temporal difference learning. Machine Learning, 22(13):33–57, 1996.
 Brafman & Tennenholtz (2002) Brafman, Ronen I. and Tennenholtz, Moshe. RMAX—a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3:213–231, 2002.
 Brockman et al. (2016) Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig, Schneider, Jonas, Schulman, John, Tang, Jie, and Zaremba, Wojciech. Openai gym, 2016.
 Dayan & Sejnowski (1996) Dayan, Peter and Sejnowski, Terrence J. Exploration bonuses and dual control. Machine Learning, 25:5–22, 1996.
 Ernst et al. (2005) Ernst, Damien, Geurts, Pierre, and Wehenkel, Louis. Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
 Farahmand (2011) Farahmand, Amirmassoud. Regularization in reinforcement learning. PhD thesis, University of Alberta, 2011.
 Farahmand & Szepesvári (2011) Farahmand, Amirmassoud and Szepesvári, Csaba. Model selection in reinforcement learning. Machine Learning, 85:299–332, 2011.
 Farahmand et al. (2008a) Farahmand, Amir Massoud, Ghavamzadeh, Mohammad, Szepesvári, Csaba, and Mannor, Shie. Regularized Policy Iteration. Nips, pp. 441–448, 2008a.
 Farahmand et al. (2008b) Farahmand, Amir Massoud, Ghavamzadeh, Mohammad, Szepesvári, Csaba, and Mannor, Shie. Regularized fitted QIteration: Application to planning. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5323 LNAI:55–68, 2008b.
 Farahmand et al. (2009) Farahmand, Amir Massoud, Ghavamzadeh, Mohammad, Szepesvári, Csaba, and Mannor, Shie. Regularized fitted qiteration for planning in continuousspace markovian decision problems. Proceedings of the American Control Conference, pp. 725–730, 2009.
 Hessel et al. (2018) Hessel, Matteo, Modayil, Joseph, van Hasselt, Hado, Schaul, Tom, Ostrovski, Georg, Dabney, Will, Horgan, Dan, Piot, Bilal, Azar, Mohammad Gheshlaghi, and Silver, David. Rainbow: Combining improvements in deep reinforcement learning. In AAAI, 2018.
 Jiang et al. (2015) Jiang, Nan, Kulesza, Alex, Singh, Satinder, and Lewis, Richard. The dependence of effective planning horizon on model accuracy. In Proceedings of AAMAS, pp. 1181–1189, 2015.
 Johns et al. (2010) Johns, J, PainterWakefield, C, and Parr, R. Linear complementarity for regularized policy evaluation and improvement. Advances in neural information processing systems, 23:1009–1017, 2010.
 Kearns & Singh (1998) Kearns, Michael and Singh, Satinder. Nearoptimal reinforcement learning in polynomial time. In Proceedings of the 15th International Conference on Machine Learning, pp. 260–268, 1998. URL citeseer.nj.nec.com/kearns98nearoptimal.html.
 Kingma & Ba (2014) Kingma, Diederik P. and Ba, Jimmy Lei. Adam: A method for stochastic optimization. In arXiv preprint arXiv:1412.6980, 2014.

Kolter & Ng (2009)
Kolter, J. Zico and Ng, Andrew Y.
Regularization and feature selection in leastsquares temporal difference learning.
Proceedings of the 26th Annual International Conference on Machine Learning  ICML ’09, 94305:1–8, 2009.  Littman & Szepesvári (1996) Littman, Michael L. and Szepesvári, Csaba. A generalized reinforcementlearning model: Convergence and applications. In Saitta, Lorenza (ed.), Proceedings of the Thirteenth International Conference on Machine Learning, pp. 310–318, 1996.
 Liu et al. (2012) Liu, Bo, Mahadevan, Sridhar, and Liu, Ji. Regularized offpolicy tdlearning. In Advances in Neural Information Processing Systems, pp. 836–844, 2012.
 Loth et al. (2007) Loth, Manuel, Davy, Manuel, and Preux, Philippe. Sparse temporal difference learning using lasso. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 352–359. IEEE, 2007.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A., Veness, Joel, Bellemare, Marc G., Graves, Alex, Riedmiller, Martin A., Fidjeland, Andreas, Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis. Humanlevel control through deep reinforcement learning. Nature, 518:529–533, 2015.
 Mnih et al. (2016) Mnih, Volodymyr, Badia, Adrià Puigdomènech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy P., Harley, Tim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
 Ng et al. (2003) Ng, Andrew Y., Kim, H. Jin, Jordan, Michael I., and Sastry, Shankar. Autonomous helicopter flight via reinforcement learning. In Advances in Neural Information Processing Systems 16 (NIPS03), 2003.
 Petrik & Scherrer (2008) Petrik, Marek and Scherrer, Bruno. Biasing approximate dynamic programming with a lower discount factor. Advances in Neural Information Processing Systems (NIPS), 1:1–8, 2008.
 Petrik et al. (2010) Petrik, Marek, Taylor, Gavin, Parr, Ron, and Zilberstein, Shlomo. Feature selection using regularization in approximate linear programs for markov decision processes. arXiv preprint arXiv:1005.1860, 2010.
 Puterman (1994) Puterman, Martin L. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994.
 Rummery (1994) Rummery, G. A. Problem solving with reinforcement learning. PhD thesis, Cambridge University Engineering Department, 1994.
 Schulman et al. (2017) Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford, Alec, and Klimov, Oleg. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 Schweitzer & Seidmann (1985) Schweitzer, Paul J and Seidmann, Abraham. Generalized polynomial approximations in markovian decision processes. Journal of mathematical analysis and applications, 110(2):568–582, 1985.
 Somani et al. (2013) Somani, A, Ye, Nan, Hsu, D, and Lee, Ws. DESPOT : Online POMDP Planning with Regularization. Advances in Neural Information Processing Systems, pp. 1–9, 2013.
 Strehl et al. (2009) Strehl, Alexander L., Li, Lihong, and Littman, Michael L. Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10:2413–2444, 2009.
 Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. The MIT Press, 1998.
 Sutton et al. (1999) Sutton, Richard S., Precup, Doina, and Singh, Satinder. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1–2):181–211, 1999.
 Sutton et al. (2000) Sutton, Richard S., McAllester, David, Singh, Satinder, and Mansour, Yishay. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems 12, pp. 1057– 1063, 2000.
 Taylor & Parr (2009) Taylor, Gavin and Parr, Ronald. Kernelized value function approximation for reinforcement learning. Proceedings of the 26th Annual International Conference on Machine Learning  ICML ’09, pp. 1–8, 2009.
 Thomas et al. (2015) Thomas, Philip, Theocharous, Georgios, and Ghavamzadeh, Mohammad. High Confidence Policy Improvement. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 2380–2388, 2015.
 Williams (1992) Williams, Ronald J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992.
 Williams & Baird (1993) Williams, Ronald J and Baird, Leemon C. Tight performance bounds on greedy policies based on imperfect value functions. Technical report, Citeseer, 1993.
 Xu et al. (2005) Xu, Xin, Xie, Tao, Hu, Dewen, and Lu, Xicheng. Kernel leastsquares temporal difference learning. International Journal of Information Technology, 11(9):54–63, 2005.