Structural Return Maximization for Reinforcement Learning

05/12/2014 ∙ by Joshua Joseph, et al. ∙ MIT 0

Batch Reinforcement Learning (RL) algorithms attempt to choose a policy from a designer-provided class of policies given a fixed set of training data. Choosing the policy which maximizes an estimate of return often leads to over-fitting when only limited data is available, due to the size of the policy class in relation to the amount of data available. In this work, we focus on learning policy classes that are appropriately sized to the amount of data available. We accomplish this by using the principle of Structural Risk Minimization, from Statistical Learning Theory, which uses Rademacher complexity to identify a policy class that maximizes a bound on the return of the best policy in the chosen policy class, given the available data. Unlike similar batch RL approaches, our bound on return requires only extremely weak assumptions on the true system.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) (Sutton & Barto, 1998) is a framework for sequential decision making under uncertainty with the objective of finding a policy that maximizes the sum of rewards, or return, of an agent. A straightforward model-based approach to batch RL, where the algorithm learns a policy from a fixed set of data, is to fit a dynamics model by minimizing a form of prediction error (e.g., minimum squared error) and then compute the optimal policy with respect to the learned model (Bertsekas, 2000). As discussed in Baxter & Bartlett (2001) and Joseph et al. (2013), learning a model for RL by minimizing prediction error can result in a policy that performs arbitrarily poorly for unfavorably chosen model classes. To overcome this limitation, a second approach is to not use a model and directly learn the policy from a policy class that explicitly maximizes an estimate of return (Meuleau et al., 2000).

With limited data, approaches that explicitly maximize estimated return are vulnerable to learning policies which perform poorly since the return cannot be confidently estimated. We overcome this problem by applying the principle of Structural Risk Minimization (SRM) (Vapnik, 1998), which, in terms of RL, states that instead of choosing the policy which maximizes the estimated return we should instead maximize a bound on return. In SRM the policy class size is treated as a controlling variable in the optimization of the bound, allowing us to naturally trade-off between estimated performance and estimation confidence. By controlling policy class size in this principled way we can overcome the poor performance of approaches which explicitly maximize estimated return with small amounts of data.

The main contribution of this work is a batch RL algorithm which has bounded true return under extremely weak assumptions, unlike standard batch RL approaches. Our algorithm is the result of applying the principle of SRM to RL, which previously has only been studied in the context of classification and regression. We first show a bound on the return of a single policy from a fixed policy class based on a technique called Model-Free Monte Carlo (Fonteneau et al., 2012). We then map RL to classification, allowing us to transfer generalization bounds based on Rademacher complexity (Bartlett & Mendelson, 2003) which results in a bound on the return of any policy from a policy class. Given a structure of policy classes, we then apply the principle of SRM to find the highest performing policy from the family of policy classes.

Section 2 reviews Structural Risk Minimization in the context of classification. We move to RL in Section 3 and show the bound on the return of a policy. Section 4 ties together the previous two sections to provide a bound on return for a policy from a structure of policy classes and discusses some of the natural policy class structures that exist in RL. Section 5 first demonstrates our approach on a simple domain to build intuition for the reader and then validates its performance on increasingly difficult problems. Sections 6 discusses related work and Section 7 concludes the paper.

2 Structural Risk Minimization for Classifier Learning

In this section we review Structural Risk Minimization for classification for completeness and to ensure that the parallels presented in Section 4 are clear for the reader. Classification is the problem of deciding on an output, , for a given input, . The performance of a decision rule is measured using risk, , defined as



is the loss function and

is the data generating distribution. For a class of decision rules, , the objective of classification is to select the decision rule which minimizes risk or, more formally,


2.1 Empirical Risk Minimization

Commonly, the distribution in Equation 1 is unknown, and we therefore are unable solve Equation 2 using Equation 1. Given a dataset where is drawn i.i.d. from , Equation 1 can be approximated by empirical risk


By using empirical risk as an estimate of risk we can attempt to solve Equation 2 using the principle of Empirical Risk Minimization (ERM) (Vapnik, 1998) where


In Section 3 we see that there is a clear analogy between this result and how a policy’s return is estimated in RL.

2.2 Bounding the Risk of a Classifier

We can bound risk (Equation 1) using using empirical risk (Equation 3) with a straightforward application of Hoeffding’s inequality


which holds with probability

. Since Equation 4 is used to choose , we need to ensure that is bounded for all (not just for a single as Equation 5 guarantees). Bounds of this form () can be written as


where can be thought of as a complexity penalty on the size of and the bound holds with probability .

Section 2.2.1 and 2.2.2 describes a specific forms of using Vapnik-Chervonenkis Dimension and Rademacher complexity, which we chose due to their popularity in the literature although many additional bounds are known, e.g., maximum discrepancy (Bartlett et al., 2002a), local Rademacher complexity (Bartlett et al., 2002b), Gaussian complexity (Bartlett & Mendelson, 2003).

2.2.1 Vapnik-Chervonenkis Dimension

A well studied bound from Vapnik (1995) that takes the form of Equation 6 uses


where and is the Vapnik-Chervonenkis (VC) dimension of (see Vapnik (1998) for a thorough description of VC dimension).

2.2.2 Rademacher Complexity

In contrast to the well studied bound from Vapnik (1995) which depends on the Vapnik-Chervonenkis (VC) dimension of , Bartlett & Mendelson (2003) provide a bound based on the Rademacher complexity of , a quantity that can be straightforwardly estimated from the dataset. Their bound, which takes the form of Equation 6, uses


where , and

is a uniform random variable over

. , the Rademacher complexity of , can be estimated using


Bartlett & Mendelson (2003) also show that the error from estimating Rademacher complexity using the right hand side of Equation 9 is bounded with probability by


2.3 Structural Risk Minimization

As discussed in Vapnik (1995), the principle of ERM is only intended to be used with a large amount of data (relative to the size of ). With a small data set, a small value of does not guarantee that will be small and therefore solving Equation 4 says little about the generalization of . The principle of Structural Risk Minimization (SRM) states that since we cannot guarantee the generalization of ERM under limited data we should explicitly minimize the bound on generalization (Equation 6) by using a structure of function classes.

A structure of function classes is defined as a collection of nested subsets of functions where

. For example, a structure of radial basis functions created by placing increasing limits on the magnitude of the basis functions. SRM then treats the capacity of

as a controlling variable and minimizes Equation 6 for each such that


To solve Equation 11 we must solve both equations jointly. One can imagine enumerating , finding for each , and choosing the corresponding which minimizes .

3 Bounding the Return of a Policy

Section 2

allowed us to bound classification performance given small amounts of data; we now turn out attention to bounding policy performance. A finite time Markov Decision Process (MDP) is defined as a tuple

where is the state space, is the action space, is the disturbance space111The disturbance space is introduced so we may assume the dynamics model is deterministic and add noise through . This is primarily done to facilitate theoretical analysis and is equivalent to the standard RL notation which uses a stochastic dynamics model that does not include ., is the dynamics model, is the reward function, is the starting state, and is the maximum length of an episode222For simplicity we assume and are known and deterministic and is only a function of the current state. Without loss of generality this allows us to write as a scalar everywhere. This work can be straightforwardly extended an unknown and stochastic and and the more general .. For a policy, , we define its return333 is shorthand for , the expected sum of rewards from following policy from state . as


We call the sequence an episode of data where and .

Throughout this work we assume that the return is bounded by and that the dynamics model, reward function, and policy are Lipschitz continuous with constants , and , respectively (Fonteneau et al., 2012), where

The objective of an MDP is to find the policy


from a given policy class, . Typically in RL the dynamics model, , is unavailable to us and therefore we are unable use Equation 13 to solve Equation 14.

3.1 Estimating the Return of a Policy

To overcome not knowing in Equation 13 we commonly use data of interactions with to estimate , called policy evaluation, and then solve Equation 14. A difficulty in estimating lies in how the data was collected, or, more precisely, which policy was used to generate the data. We discuss two different types of policy evaluation, on-policy, where the data used to estimate is generated using , and off-policy, where the data is collected using any policy.

3.1.1 On-Policy Policy Evaluation

Given a set of episodes of data collected using , we can estimate using empirical return (analogous to Equation 3) where


, , , and Equation 16 holds with probability by Hoeffding’s inequality (Hoeffding, 1963). This approach, where episodes are generated on-policy and then the return is estimated using Equation 15, is called Monte Carlo policy evaluation (Sutton & Barto, 1998). Since we do not make any assumptions about the policies that will be evaluated nor the policy under which the data was generated, we cannot directly use Equations 15 and 16 but we will build upon them in the following sections.

3.1.2 Off-Policy Policy Evaluation

Naively, using Equation 15 to approximate and solve Equation 14, would require data for each , an infinite amount of data for infinite policy classes. Off-policy policy evaluation aims to alleviate this issue by estimating using episodes where may be different than . To perform off-policy evaluation, we use an approach called Model-Free Monte Carlo-like policy evaluation (MFMC) (Fonteneau et al., 2012), which attempts to approximate the estimator from Equation 15 by piecing together artificial episodes of on-policy data from off-policy, batch data.

Consider a set of data , which we re-index as one-step transitions

To evaluate a policy, , MFMC uses a distance function

and pieces together artificial episodes from such that is an artificial on-policy episode approximating an episode for . To construct , MFMC starts with and for we find


where and once a transition, , is chosen using Equation 17 it is removed from . Following the construction of episodes , MFMC estimates the return of using


We can bound the return using Theorem 4.1, Lemma A.1, and Lemma A.2 of Fonteneau et al. (2010) and say that



for each chosen using Equation 17, and

The term is the maximum deviation between the true return of any policy and the expected MFMC estimate of return of that policy. See Fonteneau et al. (2010, 2012) for a discussion regarding the choice of and .

3.1.3 Probabilistic Bound of the MFMC Estimator

Unfortunately, the bound provided in Equation 19 only allows us to bound the return using the expectation of the MFMC estimate, not the realized estimate based on the data, which is required in Section 4. We present such a bound, beginning with Hoeffding’s inequality (Hoeffding, 1963)444Note MFMC only allows a single transition to be used once in constructing a set of episodes. Therefore, the return from the episodes do not violate the independence assumption required to use Hoeffding’s inequality.,


where we move from Equation 20 to Equation 21 using Equation 19. Setting , solving for , and substituting the quantity into Equation 21 we see that with at least probability


While Equation 22 is useful for bounding the return estimate using MFMC, in Section 4 we will require a bound between and . By combining Equations 16 and 22, we have with probability


4 Structural Return Maximization for Policy Learning

Sections 2 and 3 provide intuition for the similarities between classification and RL, e.g., in classification we choose using Equations 1 and 2, and in RL we choose using Equations 13 and 14. In this section we aim to formalize the relationship between classification and RL and by doing so we are able to use Structural Risk Minimization (Section 2.3) to learn policies drawn from policy classes that are appropriately sized given the available data.

4.1 Mapping Classification to Reinforcement Learning

The difficulty in mapping classification to RL is most easily seen when we consider the bounds in Sections 2.2.1 and 2.2.2, for which we need to know the RL equivalent of to either compute its VC dimension (for Equation 7) or its Rademacher complexity (for Equation 8). To show the mapping we begin by defining the return function,


and show that the classification objective of minimizing risk is equivalent to the RL objective of maximizing return. Using Equation 2 and a loss function that does not depend on 555While it may seem that writing without is an abuse of notation, if instead we view as a measure of performance the relationship between RL and classification becomes more clear., we set and see that


where we go from Equation 25 to Equation 26 by setting and noting that for , and we move from Equation 26 to Equation 27 using Equation 13. Therefore, minimizing for is identical to maximizing for . We see that is the term in brackets in Equation 12; encodes both the reward function and dynamics model and is analogous to classification’s . This is a crucial relationship since transferring the bounds from Section 2.2 depends on our being able to calculate the RL equivalent of , which we denote 666Note that it is difficult to tell how deep the connection between classification and RL is and, while we think this will prove an interesting line of future research, the purpose of this connection is solely to provide us with the machinery necessary for the bounds..

4.2 Bound on Return for all Policies in a Policy Class

Using the mapping from the previous section we can rewrite Equations 3 and 6 for RL as


where . Note that in Equation 28 is needed to compute the empirical return which typically is not observed in practice. Since we assume the state is fully observable, we can use Equation 28 to calculate empirical return. Combining Equations 23 and 29,


which holds with probability and is the bound on return for all policies in . Section 4.4 describes the use Rademacher complexity (Section 2.2.2) to compute in Equation 30.

4.3 Bound Based on VC Dimension for RL

To use the bound described in Section 2.2.1 we need to know the VC dimension of . Unfortunately, the VC dimension is only known for specific function classes (e.g., linear indicator functions (Vapnik, 1998)), and, since the only assumptions we made about functional form of , , and is that they are Lipshitz continuous, will not in general have known VC dimension.

There also exist known bounds on the VC dimension for e.g.

, neural networks

(Anthony & Bartlett, 1999)

, decision trees

(Asian et al., 2009)

, support vector machines

(Vapnik, 1998), smoothly parameterized function classes (Lee et al., 1995), but, again, due to our relatively few assumptions on , , and , is not a function class with a known bound on the VC dimension. Cherkassky & Mulier (1998) notes that for some classification problems, the VC dimension of can be used as a reasonable approximation for the VC dimension , but we have no evidence to support this being an accurate approximation when used in RL.

Other work has been done to estimate the VC dimension from data (Shao et al., 1969; Vapnik et al., 1994) and bound the VC dimension estimate (McDonald et al., 2011). While, in principle, we are able to estimate the VC dimension using one of these techniques, the approach described in Section 4.4 is a far simpler method for computing in Equation 29 based on data.

4.4 Bound Based on Rademacher Complexity for RL

Using the Rademacher complexity bound (Section 2.2.2) allows us to calculate (Equation 29) based on data. The only remaining piece is how to calculate the summation inside the absolute value sign of Equation 9 for RL. Mapping the Rademacher complexity estimator (Equation 9) into RL yields


Therefore, with probability


where we move from Equation 31 to Equation 32 using


from Equation 23 and we assumed for simplicity.

4.5 Structures of Policy Classes

In Section 2.3 we defined a structure of function classes, from which a function class and function from that class are chosen using Equation 11. To use structural risk minimization for RL, we must similarly define a structure,


of policy classes777Note that the structure is technically over but .. For RL we add the additional constraint that the policies must be Lipschitz continuous (Section 3) in order to use the bound provided in Section 3.1.2. Note that the structure (e.g., the indexing order ) must be specified before any data is seen.

Fortunately, some function classes have a “natural” ordering which may be taken advantage of, for example support vector machines (Vapnik, 1998) use decreasing margin size as an ordering of the structure. In RL, many common policy representations contain natural structure and are also Lipschitz continuous. Consider a policy class consisting of the linear combinations of radial basis functions (Menache et al., 2005). This class is Lipschitz continuous and using this representation, we may impose a structure by progressively increasing a limit on the magnitude of all basis functions, therefore high allows for a greater range of actions a policy may choose. This policy class consists of policies of the form


where , are fixed beforehand and the progressively increasing limit and is the index of the policy class structure.

A second policy representation which meets our requirements consists of policies of the form


where are fixed beforehand and a policy of this representation is described by set . Using this representation, we then present two possible structures, the second of which we use for the experiments in Section 5. The first places a cap on , where and . The second structure “ties” together such that for , . For , we untie but still maintain that and continue for the remaining such that Equation 34 is maintained.

Even though the representations described in this section are natural structures that can be used for many RL policy classes, the performance of each will strongly depend on the problem. In general, choosing a policy class for a RL problem is often difficult and therefore it often must be left up to a designer. We leave further investigation into automatically constructing structures or making the structure data-dependent (e.g., Shawe-Taylor et al. (1996)) to future work.

4.6 Finding the Best Policy from a Structure of Policy Classes

Using a structure of policy classes as described in the previous section, we may now reformulate the objective of SRM into Structural Return Maximization for RL as


where is computed using Equations 8, 10, and 32. Therefore, with a small batch of data, Equations 37 and 38 will choose a small and as we acquire more data naturally grows. To solve Equation 38 we follow Joseph et al. (2013) and use standard gradient decent with random restarts.

5 Empirical Results

Joseph et al. (2013) demonstrated the utility of a maximum return (MR) learner on a variety of simulated domains and a real-world problem with extremely complex dynamics. As the results from Joseph et al. (2013) show, MR performs poorly with little data and in this section we empirically demonstrate how Structural Return Maximization (SRM) remedies this shortcoming in a variety of domains.

The evaluation of SRM is done in comparison to MR due to MR being the only algorithm, to the best of our knowledge, in the RL literature which has identical assumptions to SRM. For each experiment, the chosen approaches888See Section 1 for our discussion on the pitfalls of using model-based approaches in this setting. choose policies from a designer-provided policy class. To use SRM we imposed a structure on the policy class (where the original policy class was the largest, most expressive class in the structure) and the methodology from Section 4.6 allowed SRM to select an appropriately sized policy class and policy from that class. The MR learner maximizes the empirical return (Equation 15) using the single, largest policy class.

We compare these approaches on three simulated problems: a 1D toy domain, the inverted pendulum domain, and an intruder monitoring domain. The domains demonstrate how SRM naturally grows the policy class as more data is seen and is far less vulnerable to over-fitting with small amounts of data than a MR learner. Policy classes were comprised of linear combinations of radial basis functions (Section 4.5) and training data was collected from a random policy. To piece together artificial trajectories for MFMC we used where is the number of data episodes.

(a) 1D Toy Domain
(b) 1D Toy Domain
(c) Inverted Pendulum
(d) Intruder Monitoring
Figure 1:

Performance versus the amount of training data and class size versus the amount of training data (a, b) on the 1D toy domain. Performance versus the amount of training data on the inverted pendulum domain (c) and the intruder monitoring domain (d). Error bars represent the 95% confidence interval of the mean.

5.1 1D Toy Domain

The purpose of the 1D toy domain is to enable understanding for the reader. The domain is a single dimensional world consisting of an agent who begins at and attempts to “stabilize” at in the presence of noise. The dynamics are , and is a uniform random variable over . The agent takes actions and has reward function . For the policy representation we used four evenly spaced radial basis functions and for SRM we imposed five limits on (see Equation 35).

Figures 1(a) and 1(b) show the performance of SRM (red solid line) and MR (blue solid line) on the 1D toy domain, where figure 1(a) is zoomed in to highlight the SRM’s performance with small amounts of data. The plot shows that MR over-fits the small amounts of data, resulting in poor performance. On the other hand, SRM overcomes the problem of over-fitting by selecting a policy class which is appropriately sized for the amount of data. Figure 1(b) illustrates how SRM (red dashed line) selects a larger policy class as more data is seen, in contrast to MR, which uses a fixed policy class (blue dashed line). The figures show that as SRM is given more data, it selects increasing larger classes, allowing it learn higher performing policies without over-fitting.

5.2 Inverted Pendulum

The inverted pendulum is a standard RL benchmark problem (see Lagoudakis & Parr (2003b) for a detailed explanation and parameterization of the system). In our experiments we started the pendulum upright with the objective of learning policies which stabilize the pendulum in the presence of noise. For the policy representation we placed 16 evenly spaced radial basis functions and for SRM we imposed two limits on (Equation 35), .

Figure 1(c) shows the performance of SRM (red) and MR (blue) on inverted pendulum. Similar to the results on the 1D toy domain, we see that MR over-fits the training data early on, resulting in poor performance. In contrast, SRM achieves higher performance early on by using a small policy class with small amounts of data and growing the policy class as more data is seen.

5.3 Intruder Monitoring

The intruder monitoring domain models the scenario of an intruder transversing a two dimensional world where a camera must monitor the intruder around a sensitive location. The camera observes a circle of radius centered at , and the intruder, located at , wanders toward the sensitive location with additive uniform noise. The camera dynamics follow , where the agent takes action and has reward where . For our policy representation, we placed 16 radial basis points on a grid inside and for SRM we imposed two limits on , .

Figure 1(d) shows the performance of SRM (red) and MR (blue) on inverted pendulum. Similar to the results on the previous domains, we see that SRM outperforms MR due to using a small policy class with small amounts of data and growing the policy class as more data is seen.

6 Related Work

While there has been a significant amount of prior work relating Reinforcement Learning (RL) and classification (Langford & Zadrozny, 2003; Lagoudakis & Parr, 2003a; Barto & Dietterich, 2004; Langford, 2005), to the best of our knowledge, our work is the first to sufficiently develop the mapping to allow the analysis presented in Section 2

to be transfered to RL. There has also been work to use classifiers to represent policies in RL

(Bagnell et al., 2003; Rexakis & Lagoudakis, 2008; Dimitrakakis & Lagoudakis, 2008; Blatt & Hero, 2006), which is tangential to our work; our focus is on using the principle Structural Risk Minimization for RL. Additional work uses classification theory to bound performance for on-policy data (Lazaric et al., 2010; Farahmand et al., 2012), for which Section 3.1.3 can be seen as extending to batch, off-policy data.

A second class of approaches aim to prevent over-fitting with small amounts of data by using both frequentist and Bayesian forms of regularization Strens (2000); Abbeel & Ng (2005); Bartlett & Tewari (2009); Doshi-Velez et al. (2010). These methods either lack formal guarantees similar to Equation 29 for batch data or require strong assumptions about the form of the true dynamics model, the true value function, or the optimal policy.

The RL literature also has a great deal of work growing representations as more data is seen. Past work in the model-based (Doshi-Velez, 2009; Joseph et al., 2011) and value-based (Ratitch & Precup, 2004; Whiteson et al., ; Geramifard et al., 2011) settings have proven successful but generally require either prior distributions or a large amount of training data collected under a specific policy. Additionally, our approach applies in model-based, value-based, and policy search settings by treating either the model class or value function class as indirect policy representations.

7 Conclusion

In this work we applied Structural Risk Minimization to Reinforcement Learning (RL) to allow us to learn appropriately sized policy classes for a given amount of batch data. The resulting algorithm had provable performance bounds under extremely weak assumptions. To accomplish this we presented a mapping of classification to RL which allowed us to transfer the theoretical bounds previously developed in the context of classification. These bounds allowed us to learn the policy from a structured policy class which maximized the bound on return. We demonstrated the benefit of our approach on a 1D toy, inverted pendulum, and intruder monitoring domains as compared to an agent which naively maximizes the empirical return of the single, large policy class.


  • Abbeel & Ng (2005) Abbeel, Pieter and Ng, Andrew Y. Exploration and apprenticeship learning in reinforcement learning. In ICML, 2005.
  • Anthony & Bartlett (1999) Anthony, Martin and Bartlett, Peter L. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
  • Asian et al. (2009) Asian, Ozlem, Yildiz, Olcay Taner, and Alpaydin, Ethem. Calculating the vc-dimension of decision trees. In ISCIS 2009, 14-16 September 2009, North Cyprus, pp. 193–198. IEEE, 2009.
  • Bagnell et al. (2003) Bagnell, J. Andrew, Kakade, Sham, Ng, Andrew, and Schneider, Jeff. Policy search by dynamic programming. In NIPS, 2003.
  • Bartlett & Mendelson (2003) Bartlett, Peter L. and Mendelson, Shahar. Rademacher and gaussian complexities: risk bounds and structural results.

    Journal of Machine Learning

    , 2003.
  • Bartlett & Tewari (2009) Bartlett, Peter L and Tewari, Ambuj. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In UAI, 2009.
  • Bartlett et al. (2002a) Bartlett, Peter L., Boucheron, Stéphane, and Lugosi, Gábor. Model selection and error estimation. Machine Learning, 48(1-3):85–113, 2002a.
  • Bartlett et al. (2002b) Bartlett, Peter L., Bousquet, Olivier, and Mendelson, Shahar. Local rademacher complexities. In Annals of Statistics, pp. 44–58, 2002b.
  • Barto & Dietterich (2004) Barto, A.G. and Dietterich, T.G.

    Reinforcement learning and its relationship to supervised learning.

    Handbook of learning and approximate dynamic programming, 2004.
  • Baxter & Bartlett (2001) Baxter, Jonathan and Bartlett, Peter L. Infinite-horizon policy-gradient estimation.

    Journal of Artificial Intelligence Research

    , 2001.
  • Bertsekas (2000) Bertsekas, Dimitri P. Dynamic Programming and Optimal Control. Athena Scientific, 2000.
  • Blatt & Hero (2006) Blatt, Doron and Hero, Alfred. From weighted classification to policy search. In Weiss, Y., Schölkopf, B., and Platt, J. (eds.), NIPS. 2006.
  • Cherkassky & Mulier (1998) Cherkassky, Vladimir S. and Mulier, Filip. Learning from Data: Concepts, Theory, and Methods. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1998.
  • Dimitrakakis & Lagoudakis (2008) Dimitrakakis, Christos and Lagoudakis, Michail G. Rollout Sampling Approximate Policy Iteration. Machine Learning, 72(3):157–171, September 2008.
  • Doshi-Velez (2009) Doshi-Velez, Finale. The infinite partially observable markov decision process. In NIPS, 2009.
  • Doshi-Velez et al. (2010) Doshi-Velez, Finale, Wingate, David, Roy, Nicholas, and Tenenbaum, Joshua. Nonparametric bayesian policy priors for reinforcement learning. 2010.
  • Farahmand et al. (2012) Farahmand, Amir-Massoud, Precup, Doina, and Ghavamzadeh, Mohammad. Generalized classification-based approximate policy iteration. In European Workshop on Reinforcement Learning, 2012.
  • Fonteneau et al. (2010) Fonteneau, Raphael, Murphy, Susan A., Wehenkel, Louis, and Ernst, Damien. Model-free monte carlo-like policy evaluation. Journal of Machine Learning Research - Proceedings Track, 2010.
  • Fonteneau et al. (2012) Fonteneau, Raphael, Murphy, SusanA, Wehenkel, Louis, and Ernst, Damien. Batch mode reinforcement learning based on the synthesis of artificial trajectories. 2012.
  • Geramifard et al. (2011) Geramifard, Alborz, Doshi, Finale, Redding, Joshua, Roy, Nicholas, and How, Jonathan. Online discovery of feature dependencies. In ICML, 2011.
  • Hoeffding (1963) Hoeffding, Wassily. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, March 1963.
  • Joseph et al. (2011) Joseph, Joshua, Doshi-Velez, Finale, Huang, Albert S., and Roy, Nicholas. A Bayesian Nonparametric Approach to Modeling Motion Patterns. Autonomous Robots, 31(4):383–400, 2011.
  • Joseph et al. (2013) Joseph, Joshua, Geramifard, Alborz, Roberts, John W., How, Jonathan P., and Roy, Nicholas. Reinforcement learning with misspecified model classes. In ICRA, 2013.
  • Lagoudakis & Parr (2003a) Lagoudakis, Michail G. and Parr, Ronald. Reinforcement learning as classification: Leveraging modern classifiers. In ICML, 2003a.
  • Lagoudakis & Parr (2003b) Lagoudakis, Michail G. and Parr, Ronald. Least-squares policy iteration. JMLR, 4:1107–1149, 2003b.
  • Langford (2005) Langford, John. Relating reinforcement learning performance to classification performance. In ICML, 2005.
  • Langford & Zadrozny (2003) Langford, John and Zadrozny, Bianca. Reducing t-step reinforcement learning to classification, 2003.
  • Lazaric et al. (2010) Lazaric, A., Ghavamzadeh, M., Munos, R., et al. Analysis of a classification-based policy iteration algorithm. 2010.
  • Lee et al. (1995) Lee, Wee Sun, Bartlett, Peter, and Williamson, Robert. Lower bounds on the vc-dimension of smoothly parametrized function classes. Neural Computaion, 1995.
  • McDonald et al. (2011) McDonald, Daniel, Shalizi, Cosma, and Schervish, Mark. Estimated vc dimension for risk bounds. 2011.
  • Menache et al. (2005) Menache, Ishai, Mannor, Shie, and Shimkin, Nahum. Basis function adaptation in temporal difference reinforcement learning. Annals OR, 134(1):215–238, 2005.
  • Meuleau et al. (2000) Meuleau, Nicolas, Peshkin, Leonid, Kaelbling, Leslie, and Kim, Kee. Off-policy policy search. Technical report, MIT Arti cical Intelligence Laboratory, 2000.
  • Ratitch & Precup (2004) Ratitch, Bohdana and Precup, Doina. Sparse distributed memories for on-line value-based reinforcement learning. In Machine Learning: ECML, Lecture Notes in Computer Science, 2004.
  • Rexakis & Lagoudakis (2008) Rexakis, Ioannis and Lagoudakis, Michail G. Classifier-based policy representation. In ICMLA, 2008.
  • Shao et al. (1969) Shao, Xuhui, Cherkassky, Vladimir, and Li, William. Measuring the vc-dimension using optimized experimental design. Neural Computation, 12:2000, 1969.
  • Shawe-Taylor et al. (1996) Shawe-Taylor, John, Holloway, Royal, Bartlett, Peter L., Williamson, Robert C., and Anthony, Martin. Structural risk minimization over data-dependent hierarchies, 1996.
  • Strens (2000) Strens, Malcolm. A bayesian framework for reinforcement learning. In ICML, pp. 943–950, 2000.
  • Sutton & Barto (1998) Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press, May 1998.
  • Vapnik (1998) Vapnik, Vladimir. Statistical learning theory. Wiley, 1998. ISBN 978-0-471-03003-4.
  • Vapnik et al. (1994) Vapnik, Vladimir, Levin, Esther, and LeCun, Yann. Measuring the vc-dimension of a learning machine. Neural Computation, 6(5):851–876, 1994.
  • Vapnik (1995) Vapnik, Vladimir N. The nature of statistical learning theory. Springer-Verlag New York, Inc, 1995.
  • (42) Whiteson, Shimon, Taylor, Matthew, and Stone, Peter. Adaptive tile coding for value function approximation. Technical report, University of Texas at Austin.