Supervised and reinforcement learning are two well-known learning paradigms, which have been researched mostly independently. Recent studies have investigated using mature supervised learning methods for reinforcement learning[9, 6, 10, 7]. Initial results have shown that policies can be approximately represented using multi-class classifiers and therefore it is possible to incorporate classification algorithms within the inner loops of several reinforcement learning algorithms [9, 6, 7]. This viewpoint allows the quantification of the performance of reinforcement learning algorithms in terms of the performance of classification algorithms . While a variety of promising combinations become possible through this synergy, heretofore there have been limited practical results and widely-applicable algorithms.
Herein we consider approximate policy iteration algorithms, such as those proposed by Lagoudakis and Parr  as well as Fern et al. [6, 7], which do not explicitly represent a value function. At each iteration, a new policy/classifier is produced using training data obtained through extensive simulation (rollouts) of the previous policy on a generative model of the process. These rollouts aim at identifying better action choices over a subset of states in order to form a set of data for training the classifier representing the improved policy. The major limitation of these algorithms, as also indicated by Lagoudakis and Parr , is the large amount of rollout sampling employed at each sampled state. It is hinted, however, that great improvement could be achieved with sophisticated management of sampling. We have verified this intuition in a companion paper 
that experimentally compared the original approach of uninformed uniform sampling with various intelligent sampling techniques. That paper employed heuristic variants of well-known algorithms for bandit problems, such as Upper Confidence Bounds and Successive Elimination , for the purpose of managing rollouts (choosing which state to sample from is similar to choosing which lever to pull on a bandit machine). It should be noted, however, that despite the similarity, rollout management has substantial differences to standard bandit problems and thus general bandits results are not directly applicable to our case.
The current paper aims to offer a first theoretical insight into the rollout sampling problem. This is done through the analysis of the two simplest sample allocation methods described in . Firstly, the old method that simply allocates an equal, fixed number of samples at each state and secondly the slightly more sophisticated method of progressively sampling all states where we are not yet reasonably certain of which the policy-improving action would be.
A Markov Decision Process (MDP) is a 6-tuple , where is the state space of the process, is a finite set of actions, is a Markovian transition model (
denotes the probability of a transition to statewhen taking action in state ), is a reward function ( is the expected reward for taking action in state ), is the discount factor for future rewards, and is the initial state distribution. A deterministic policy for an MDP is a mapping from states to actions; denotes the action choice at state . The value of a state under a policy is the expected, total, discounted reward when the process begins in state and all decisions at all steps are made according to :
The goal of the decision maker is to find an optimal policy that maximises the expected, total, discounted reward from all states; in other words, for all policies and all states .
Policy iteration (PI) is an efficient method for deriving an optimal policy. It generates a sequence , , …, of gradually improving policies, which terminates when there is no change in the policy (); is an optimal policy. Improvement is achieved by computing analytically (solving the linear Bellman equations) and the action values
and then determining the improved policy as .
Policy iteration typically terminates in a small number of steps. However, it relies on knowledge of the full MDP model, exact computation and representation of the value function of each policy, and exact representation of each policy. Approximate policy iteration
(API) is a family of methods, which have been suggested to address the “curse of dimensionality”, that is, the huge growth in complexity as the problem grows. In API, value functions and policies are represented approximately in some compact form, but the iterative improvement process remains the same. Apparently, the guarantees for monotonic improvement, optimality, and convergence are compromised. API may never converge, however in practice it reaches good policies in only a few iterations.
2.1 Rollout estimates
Typically, API employs some representation of the MDP model to compute the value function and derive the improved policy. On the other hand, the Monte-Carlo estimation technique ofrollouts provides a way of accurately estimating at any given state-action pair without requiring an explicit MDP model or representation of the value function. Instead, a generative model of the process (a simulator) is used; such a model takes a state-action pair and returns a reward and a next state sampled from and respectively.
A rollout for the state-action pair amounts to simulating a single trajectory of the process beginning from state , choosing action for the first step, and choosing actions according to the policy thereafter up to a certain horizon . If we denote the sequence of collected rewards during the -th simulated trajectory as , , then the rollout estimate of the true state-action value function is the observed total discounted reward, averaged over all trajectories:
Similarly, we define to be the actual state-action value function up to horizon . As will be seen later, with a sufficient amount of rollouts and a long horizon , we can create an improved policy from at any state , without requiring a model of the MDP.
3 Related work
Rollout estimates have been used in the Rollout Classification Policy Iteration (RCPI) algorithm , which has yielded promising results in several learning domains. However, as stated therein, it is sensitive to the distribution of training states over the state space. For this reason it is suggested to draw states from the discounted future state distribution of the improved policy. This tricky-to-sample distribution, also used by Fern et al. , yields better results. One explanation advanced in those studies is the reduction of the potential mismatch between the training and testing distributions of the classifier.
However, in both cases, and irrespectively of the sampling distribution, the main drawback is the excessive computational cost due to the need for lengthy and repeated rollouts to reach a good level of accuracy in the estimation of the value function. In our preliminary experiments with RCPI, it has been observed that most of the effort is spent where the action value differences are either non-existent, or so fine that they require a prohibitive number of rollouts to identify them. In this paper, we propose and analyse sampling methods to remove this performance bottle-neck. By restricting the sampling distribution to the case of a uniform grid, we compare the fixed allocation algorithm (Fixed) [9, 7], whereby a large fixed amount of rollouts is used for estimating the action values in each training state, to a simple incremental sampling scheme based on counting (Count), where the amount of rollouts in each training state varies. We then derive complexity bounds, which show a clear improvement using Count that depends only on the structure of differential value functions.
We note that Fern et al.  presented a related analysis. While they go into considerably more depth with respect to the classifier, their results are not applicable to our framework. This is because they assume that there exists some real number which lower-bounds the amount by which the value of an optimal action(s) under any policy exceeds the value of the nearest sub-optimal action in any state . Furthermore, the algorithm they analyse uses a fixed number of rollouts at each sampled state. For a given minimum value over all states, they derive the necessary number of rollouts per state to guarantee an improvement step with high probability, but the algorithm offers no practical way to guarantee a high probability improvement. We instead derive error bounds for the fixed and counting allocation algorithms. Additionally, we are considering continuous, rather than discrete, state spaces. Because of this, technically our analysis is much more closely related to that of Auer et al. .
4 Algorithms to reduce sampling cost
The total sampling cost depends on the balance between the number of states sampled and the number of samples per state. In the fixed allocation scheme [9, 7], the same number of rollouts is allocated to each state in a subset of states and all rollouts dedicated to a single action are exhausted before moving on to the next action. Intuitively, if the desired outcome (superiority of some action) in some state can be confidently determined early, there is no need to exhaust all rollouts available in that state; the training data could be stored and the state could be removed from the pool without further examination. Similarly, if we can confidently determine that all actions are indifferent in some state, we can simply reject it without wasting any more rollouts; such rejected states could be replaced by fresh ones which might yield meaningful results. These ideas lead to the following question: can we examine all states in collectively in some interleaved manner by selecting each time a single state to focus on and allocating rollouts only as needed?
Selecting states from the state pool could be viewed as a problem akin to a multi-armed bandit problem, where each state corresponds to an arm. Pulling a lever corresponds to sampling the corresponding state once. By sampling a state we mean that we perform a single rollout for each action in that state as shown in Algorithm 1. This is the minimum amount of information we can request from a single state.111It is possible to also manage sampling of the actions, but herein we are only concerned with the effort saved by managing state sampling. Thus, the problem is transformed to a variant of the classic multi-armed bandit problem. Several methods have been proposed for various versions of this problem, which could potentially be used in this context. In this paper, apart from the fixed allocation scheme presented above, we also examine a simple counting scheme.
The algorithms presented here maintain an empirical estimate of the marginal difference of the apparently maximal and the second best of actions. This can be represented by the marginal difference in values in state , defined as
where is the action that maximises in state :
The case of multiple equivalent maximising actions can be easily handled by generalising to sets of actions in the manner of Fern et al. , in particular
However, here we discuss only the single best action case to simplify the exposition. The estimate is defined using the empirical value function .
5 Complexity of sampling-based policy improvement
Rollout algorithms can be used for policy improvement under certain conditions. Bertsekas  gives several theorems for policy iteration using rollouts and an approximate value function that satisfies a consistency property. Specifically, Proposition 3.1. therein states that the one-step look-ahead policy computed from the approximate value function , has a value function which is better than the current approximation , if for all . It is easy to see that an approximate value function that uses only sampled trajectories from a fixed policy satisfies this property if we have an adequate number of samples. While this assures us that we can perform rollouts at any state in order to improve upon the given policy, it does not lend itself directly to policy iteration. That is, with no way to compactly represent the resulting rollout policy we would be limited to performing deeper and deeper tree searches in rollouts.
In this section we shall give conditions that allow policy iteration through compact representation of rollout policies via a grid and a finite number of sampled states and sample trajectories with a finite horizon. Following this, we will analyse the complexity of the fixed sampling allocation scheme employed in [9, 7] and compare it with an oracle that needs only one sample to determine for any and a simple counting scheme.
5.1 Sufficient conditions
Assumption 1 (Bounded finite-dimension state space)
The state space is a compact subset of .
This assumption can be generalised to other bounded state spaces easily. However, it is necessary to have this assumption in order to be able to place some minimal constraints on the search.
Assumption 2 (Bounded rewards)
for all , .
This assumption bounds the reward function and can also be generalised easily to other bounding intervals.
Assumption 3 (Hölder Continuity)
For any policy , there exists , such that for all states
This assumption ensures that the value function is fairly smooth. It trivially follows in conjunction with Assumptions 1 and 2 that are bounded everywhere in if they are bounded for at least one . Furthermore, the following holds:
Given that, by definition, for all , it follows from Assumption 3 that
for all such that .
This remark implies that the best action in some state according to will also be the best action in a neighbourhood of states around . This is a reasonable condition as there would be no chance of obtaining a reasonable estimate of the best action in any region from a single point, if could change arbitrarily fast. We assert that MDPs with a similar smoothness property on their transition distribution will also satisfy this assumption.
Finally, we need an assumption that limits the total number of rollouts that we need to take, as states with a smaller will need more rollouts.
Assumption 4 (Measure)
If denotes the Lebesgue measure of set , then, for any , there exist such that for all .
This assumption effectively limits the amount of times value-function changes lead to best-action changes, as well as the ratio of states where the action values are close. This assumption, together with the Hölder continuity assumption, imposes a certain structure on the space of value functions. We are thus guaranteed that the value function of any policy results in an improved policy which is not arbitrarily complex. This in turn, implies that an optimal policy cannot be arbitrarily complex either.
A final difficulty is determining whether there exists some sufficient horizon beyond which it is unnecessary to go. Unfortunately, even though for any state for which , there exists such that for all , grows without bound as we approach a point where the best action changes. However, by selecting a fixed, sufficiently large rollout horizon, we can still behave optimally with respect to the true value function in a compact subset of .
For any policy , , there exists a finite and a compact subset such that
where is such that for all .
From the above assumptions it follows directly that for any , there exists a compact set of states such that for all , with . Now let . Then, . For any the limit exists and thus by definition such that for all . Since is compact, also exists.222For a discount factor we can simply bound with . ∎∎
This ensures that we can identify the best action within , using a finite rollout horizon, in most of . Moreover, from Assumption 4.
In standard policy iteration, the improved policy over has the property that the improved action in any state is the action with the highest value in that state. However, in rollout-based policy iteration, we may only guarantee being within of the maximally improved policy.
Definition 5.1 (-improved policy)
An -improved policy derived from satisfies
Such a policy will be said to be improving in if for all . The measure of states for which there can not be improvement is limited by Assumption 4. Finding an improved for the whole of is in fact not possible in finite time, since this requires determining the boundaries in at which the best action changes. 333To see this, consider , with some and . Finding requires a binary search, at best.
In all cases, we shall attempt to find the improving action at each state on a uniform grid of states, with the next policy taking the estimated best action for the state closest to , i.e. it is a nearest-neighbour classifier.
In the remainder, we derive complexity bounds for achieving an -improved policy from with probability at least . We shall always assume that we are using a sufficiently deep rollout to cover and only consider the number of rollouts performed. First, we shall derive the number of states we need to sample from in order to guarantee an -improved policy, under the assumption that at each state we have an oracle which can give us the exact values for each state we examine. Later, we shall consider sample complexity bounds for the case where we do not have an oracle, but use empirical estimates at each state.
5.2 The Oracle algorithm
Let denote the infinity-norm sphere of radius centred in and consider Alg. 2 (Oracle) that can instantly obtain the state-action value function for any point in . The algorithm creates a uniform grid of states, such that the distance between adjacent states is – and so can cover with spheres . Due to Assumption 3, the error in the action values of any state in sphere of state s will be bounded by . Thus, the resulting policy will be -improved, i.e. this will be the maximum regret it will suffer over the maximally improved policy.
To bound this regret by , it is sufficient to have states in the grid. The following proposition follows directly.
Algorithm 2 results in regret for .
Furthermore, as for all such that , will be the improved action in all of , then will be improving in with . Both the regret and the lack of complete coverage are due to the fact that we cannot estimate the best-action boundaries with arbitrary precision in finite time. When using rollout sampling, however, even if we restrict ourselves to improvement, we may still make an error due to both the limited number of rollouts and the finite horizon of the trajectories. In the remainder, we shall derives error bounds for two practical algorithms that employ a fixed grid with a finite number of -horizon rollouts.
5.3 Error bounds for states
When we estimate the value function at each using rollouts there is a probability that the estimated best action is not in fact the best action. For any given state under consideration, we can apply the following well-known lemma to obtain a bound on this error probability
Lemma 5.2 (Hoeffding inequality)
Let be a random variable in
be a random variable inwith , observed values of , and . Then, for any .
Without loss of generality, consider two random variables , with empirical means and empirical difference . Their means and difference will be denoted as respectively.
Note that if , and then necessarily , so . The converse is
Now, consider such that for all . Setting and , where is a normalising constant such that , we can apply (3). Note that the bound is largest for the action with value closest to , for which it holds that . Using this fact and an application of the union bound, we conclude that for any state , from which we have taken samples, it holds that:
5.4 Uniform sampling: the Fixed algorithm
As we have seen in the previous section, if we employ a grid of states, covering with spheres , where , and taking action in each sphere centred in , then the resulting policy is only guaranteed to be improved within of the optimal improvement from , where . Now, we examine the case where, instead of obtaining the true , we have an estimate arising from samples from each action in each state, for a total of samples. Algorithm 3 accepts (i.e. it sets to be the empirically highest value action in that state) for all states satisfying:
The condition ensures that the probability that , meaning the optimally improving action is not , at any state is at most . This can easily be seen by substituting the right hand side of (5) for in (4). As , this results in an error probability of a single state smaller than and we can use a union bound to obtain an error probability of for each policy improvement step.
For each state that the algorithm considers, the following two cases are of interest: (a) , meaning that even when we have correctly identified , we are still not improving over all of and (b) .
While the probability of accepting the wrong action is always bounded by , we must also calculate the probability that we fail to accept an action at all, when to estimate the expected regret. Restating our acceptance condition as , this is given by:
Is ? Note that for , if then so is . So, in order to achieve total probability for all state-action pairs in this case, after some calculations, we arrive at this expression for the regret
By equating the two sides, we get an expression for the minimum number of samples necessary per state:
This directly allows us to state the following proposition.
The sample complexity of Algorithm 3 to achieve regret at most with probability at least is .
5.5 The Count algorithm
The Count algorithm starts with a policy and a set of states , with . At each iteration , each sample in is sampled once. Once a state contains a dominating action, it is removed from the search. So,
Thus, the number of samples from each state is if .
We can apply similar arguments to analyse Count, by noting that the algorithm spends less time in states with higher values. The measure assumption then allows us to calculate the number of states with large and thus, the number of samples that are needed.
We have already established that there is an upper bound on the regret depending on the grid resolution . We proceed by forming subsets of states . Note that we only need to consider .
Similarly to the previous algorithm, and due to our acceptance condition, for each state , we need in order to bound the total error probability by . The total number of samples necessary is
A bound on is required to bound this expression. Note that
It follows that and consequently
The above results directly in the following proposition:
The sample complexity of Algorithm 4 to achieve regret at most with probability at least , is .
We note that we are of course not able to remove the dependency on , which is only due to the use of a grid. Nevertheless, we obtain a reduction in sample complexity of order for this very simple algorithm.
We have derived performance pounds for approximate policy improvement without a value function in continuous MDPs. We compared the usual approach of sampling equally from a set of candidate states to the slightly more sophisticated method of sampling from all candidate states in parallel, and removing a candidate state from the set as soon as it was clear which action is best. For the second algorithm, we find an improvement of approximately . Our results complement those of Fern et al  for relational Markov decision processes. However significant amount of future work remains.
Firstly, we have assumed everywhere that . While this may be a relatively mild assumption for , it is problematic for the undiscounted case, as some states would require far deeper rollouts than others to achieve regret . Thus, in future work we would like to examine sample complexity in terms of the depth of rollouts as well.
Secondly, we would like to extend the algorithms to increase the number of states that we look at: whenever for all , then we could increase the resolution. For example if,
then we could increase the resolution around those states with the smallest . This would get around the problem of having to select .
A related point that has not been addressed herein, is the choice of policy representation. The grid-based representation probably makes poor use of the available number of states. For the increased-resolution scheme outlined above, a classifier such as -nearest-neighbour could be employed. Furthermore, regularised classifiers might affect a smoothing property on the resulting policy, and allow the learning of improved policies from a set of states containing erroneous best action choices.
As far as the state allocation algorithms are concerned, in a companion paper , we have compared the performance of Count and Fixed with additional allocation schemes inspired from the UCB and successive elimination algorithms. We have found that all methods outperform Fixed in practice, sometimes by an order of magnitude, with the UCB variants being the best overall.
For this reason, in future work we plan to perform an analysis of such algorithms. A further extension to deeper searches, by for example managing the sampling of actions within a state, could also be performed using techniques similar to .
Thanks to the reviewers and to Adam Atkinson, Brendan Barnwell, Frans Oliehoek, Ronald Ortner and D. Jacob Wildstrom for comments and useful discussions.
-  P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 47(2-3):235–256, 2002.
-  P. Auer, R. Ortner, and C. Szepesvari. Improved Rates for the Stochastic Continuum-Armed Bandit Problem. Proceedings of the Conference on Computational Learning Theorey (COLT), 2007.
-  Dimitri Bertsekas. Dynamic programming and suboptimal control: From ADP to MPC. Fundamental Issues in Control, European Journal of Control, 11(4-5), 2005. From 2005 CDC, Seville, Spain.
-  Christos Dimitrakakis and Michail Lagoudakis. Rollout sampling approximate policy iteration. Machine Learning, 72(3), September 2008.
-  Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 2006.
-  A. Fern, S. Yoon, and R. Givan. Approximate policy iteration with a policy language bias. Advances in Neural Information Processing Systems, 16(3), 2004.
A. Fern, S. Yoon, and R. Givan.
Approximate policy iteration with a policy language bias: Solving
relational Markov decision processes.
Journal of Artificial Intelligence Research, 25:75–118, 2006.
-  Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In Proceedings of ECML-2006, 2006.
-  Michail G. Lagoudakis and Ronald Parr. Reinforcement learning as classification: Leveraging modern classifiers. In Proceedings of the 20th International Conference on Machine Learning (ICML), pages 424–431, Washington, DC, USA, August 2003.
-  John Langford and Bianca Zadrozny. Relating reinforcement learning performance to classification performance. In Proceedings of the 22nd International Conference on Machine learning (ICML), pages 473–480, Bonn, Germany, 2005.