1 Introduction
Research in reinforcement learning focuses on designing algorithms for an agent interacting with an environment to adjust its behavior to optimize a longterm return. For environments which are fullyobservable (i.e. the observations the agent makes contain all of the necessary information about the state of the environment), this problem can often be solved using a onestep look ahead analysis to formulate the solution as a dynamic programming problem. However, for the case of partially observable domains (i.e. the observations are stochastic or incomplete representations of the environment’s state), the perceptual aliasing of the observations makes such methods infeasible.
One viable approach is to search directly in a parameterized space of policies for a local optimum. Following Williams’s reinforce algorithm (Williams92), searching by gradient descent has been considered for a variety of policy classes (Marbach; Baird99; MeuleauUAI99; SuttonNIPS99; BaxterICML00). A commonly recognized shortcoming of all these variations on gradient descent policy search is that they require a very large number of samples (instances of agentenvironment interaction) to converge.
This inefficiency arises because the value of the policy (or its derivative) is estimated by sampling from the returns obtained by following that same policy. Thus, after one policy is evaluated and a new one proposed, the samples taken from the old policy must be discarded. Each new step of the policy search algorithm requires a new set of samples. The key to solving this inefficiency is to use data gathered when using one policy to estimate the value of another policy. The method known as “likelihood ratio” estimation enables this data reuse.
Stochastic gradient methods and likelihood ratios have been long used for optimization problems (see work of Glynn86,Glynn89,Glynn90,Glynn94). Recently, stochastic gradient descent methods, in particular
reinforce (Williams92; Williams91), have been used in conjunction with policy classes constrained in various ways: with external memory (Peshkin99), finite state controllers (MeuleauUAI99) and in multiagent settings (Peshkin00). The idea of using likelihood ratios in reinforcement learning was suggested by Csaba and developed for solving mdps with function approximation by PrecupICML00 and for gradient descent in finite state controllers by MeuleauTR00. However, only online optimization was considered. Shelton01,SheltonPhD developed greedy algorithm for combining samples from multiple policies in normalized estimators and demonstrated a dramatic improvement in performance. Peshkin01 showed that likelihoodratio estimation enables the application of methods from statistical learning theory to derive
pac bounds on sample complexity.KearnsNIPS99 provide a method for estimating the return of every policy simultaneously using data gathered while executing a fixed policy without the use of likelihood ratios. In some domains, there is a natural distance between observations and actions which also allows one to reuse experience without likelihood ratio estimation. Glynn00 demonstrate algorithms for kernelbased rl in one such domain: financial planing and investments.
This paper extends our previous work by presenting a generalized method of using likelihood ratio estimation in policy search and investigating the performance of this method under different conditions on illustrative examples. By this publication we hope to stimulate a dialog between the communities of reinforcement learning and computational learning theory. We present a clear outline of all algorithms in a hope to attract wider research community to applying these algorithms in various domains. We also present some new bounds on a sample complexity of these algorithms, making an attempt to relate these results to empirical results. We begin this paper with a brief definition of reinforcement learning and sampling in order to clarify our notation. Then we present our algorithm and consider the question of how to sample. Finally we consider the question of how much to sample and present a
pacstyle bound as a quantitative answer.2 Background
We introduce the environment model and importance sampling in a single mathematical notation. In particular, we keep the standard notation for partially observable Markov decision processes and modify the sampling notation to be consistent.
2.1 Environment Model
The class of problems we consider can be described by the partially observable Markov decision process (pomdp) model. In a pomdp, a sequence of events occur for each time step: an agent observes the observation dependent on the state of environment ; it performs an action according to its policy, inducing a state transition of the environment; then it receives a reward based on the action taken and the environment’s state. A pomdp
is defined by four probability distributions (and the spaces over which those distributions are defined): a distribution over starting states, a distribution over observations conditioned on the state, a distribution over next states conditioned on the current state and the agent’s action, and a distribution over rewards given the state and action. These distributions, specifying the dynamics of the environment, are unknown to the agent along with the state space of process,
.Let denote the set of all possible experiences sequences of length . Generally speaking, in a pomdp, a policy is a function specifying the action to perform at each time step as a function of the whole previous history:
. This function is parameterized by a vector
. Policy class is a set of policies realizable by all parameter settings. We assume that the probability of the elementary event is bounded away from zero: , for any , , and . A history includes several immediate rewards that are typically summed to form a return, , but our results are independent of the method used to compute the return.Together with the distributions defined by the pomdp, any policy defines a conditional distribution on the class of all histories . The value of policy is the expected return according to the probability induced by this policy on the history space: where stands for . We assume that policy values (and returns) are nonnegative and bounded by . The objective of the agent is to find a policy with optimal value: . Because the agent does not have a model of the environment’s dynamics or reward function, it can not calculate and must estimate it via sampling.
2.2 Sampling
If we wish to estimate the value of the policy , we may draw sample histories from the distribution induced by this policy by executing the policy multiple times in the environment. After taking samples
we can use the unbiased estimator:
Imagine, however, that we are unable to sample from the policy
directly, but instead have samples from another policy . The
intuition is that if we knew how “similar” those two policies were to one
another, we could use samples drawn according to the distribution and
make an adjustment proportional to the similarity of the policies. Formally we
have:
Now we can construct an unbiased indirect estimator for the
distribution which is called an importance sampling
estimator (Rubinstein81) of :
We can normalize the importance sampling estimate to obtain a lower variance estimate at the cost of adding bias. Such an estimator is called a
weighted importance sampling estimator and has the formwhich has been found to be betterbehaved than both theoretically and empirically (MeuleauTR00; Shelton01; PrecupICML00).
Note that both estimators contain the quantity , a ratio of likelihoods. The key observation for the remainder of this paper is that while an agent is not assumed to have a model of the environment and therefore is not able to calculate , it is able to calculate the likelihood ratio for any two policies and (SutBar98; MeuleauTR00; PeshkinPhD). can be written as a product of and where is the contribution of all of the agent’s actions to the likelihood of the history, and is the contribution of environmental events. Because the component is independent of the policy (i.e. it does not depend on the policy parameter, only on the history and the pomdp distributions), it cancels from the ratio, and we have . depends only on the agent, the observations, and the actions (not the states), is known to the agent, and can be computed and differentiated. This allows us to construct more efficient learning algorithms that can take advantage of past experience.
Finally, if the sampling distribution is not constant (i.e. each sample is drawn from a different distribution), a single unbiased importance sampling estimator can be constructed by using all of the samples where the assumed single sampling distribution is the mixture of the true sampling distributions. Thus, if samples were taken according to policies , from above is replaced with . Shelton01 gives more details for importance sampling estimators with independent, but not identically drawn, samples. Using this new estimator allows us to change policies (sampling distributions) during sampling.
3 Algorithms
Consider constructing a proxy environment
that contains a nonparametric model of the values of all policies as illustrated by Figure
1. This model is a result of trying several policies . Given an arbitrary new policy , the proxy environment returns an estimate of its value as if the policy were tried in the real environment. Assuming that obtaining a sample from environment is costly, we want to construct the proxy module based only on a small number of queries about policies that return values . These queries are implemented by the sample routine (Table 1). After getting samples, it requires memory of size to store the data, where is the length of a trial and and are the sets of possible observations and action respectively. However, for many policy classes this memory requirement can be reduced. For example, if the policy class is reactive (conditioned on the current observation, the probability of the current action has no dependence on the past), the history can be summarized sufficiently by the counts of the number of times each action was chosen after each observation. This requires memory of size .
Input: policy  
Init:  
Get initial observation .  
For each time step of the trial:  
Draw next from  
Execute .  
Get observation , reward .  
Output: experience 
Input: policy , data  
Init: , ,  
For to :  
Init:  
For to :  
For to :  
For to :  
Output: proxy evaluation and derivative 
This proxy can be queried by the learning algorithm as shown in table 2. In response to a policy parameter settings, the routine evaluate returns its estimate of the expected return and its derivative. The algorithm shown in table 2 computes the weighted importance sampling estimate. For simplicity, the inner loop (over ) is shown. In practice the computations in this loop do not need to be redone for every evaluation. Using memory of size , the values can be computed ahead of time (in constant time per sample) thus reducing the evaluation to time.
The evaluate routine relies on two other routines: one to calculate and one to calculate the derivative of . Recall that is the policy’s factor of the probability of the history . As an example, if we assume the policy to be reactive, the parameter to be the probability to selecting action after observing , and to be the count of the number of times action was chosen after observing during the history,
MeuleauTR00,PeshkinPhD describe how to compute these quantities for reactive policies with Boltzmann distributions and Shelton01 describes how to compute these quantities for finitestate controllers.
Input: number of samples/trials  
Init: ,  
For to :  
Output: hypothetical optimal policy 
Any policy search algorithm can now be combined with this proxy environment to learn from scarce experience. Table 3 shows a general reinforcement learning algorithm family using the proxy. The definitions of pick_sample, add_data, and optimize are crucial to the behavior of the algorithm. The reinforce algorithm (Williams92) is one particular instantiation of the learn routine where pick_sample returns without consulting the data, add_data forgets all of the previous data and replaces it with the most recent sample, and optimize performs one step of gradient descent (using instead of ). The exploration extension to reinforce proposed by MeuleauTR00 is exactly the same except the pick_sample routine now returns a policy that is a mixture of and a random policy.
In order to make effective use of all of the data, we define add_data to append the new data sample to the collection of data. This allows our algorithm to remember all previous experience. Additionally, we use an optimize routine that performs full optimization (not just a single step). In reinforce and other policy search methods, the current policy guess embodies all of the known information about the past (forgotten) samples. It is therefore important to only take small steps of decreasing size to insure the algorithm converges. Because we now remember all of the previous samples and we do not have any restraint on which policy we must use for the next sample, we can search for the true optimum of the estimator at every step.
4 How to Sample?
We still are left with a choice for the routine pick_sample. This routine represents our balance between exploration and exploitation. For this paper, we will consider a simple possibility to illustrate this tradeoff. We let the pick_sample routine have a single parameter . pick_sample is stochastic and with probability it returns . The remainder of the time it returns a random policy chosen uniformly over the space . Thus, the larger the value of , the more exploitative the algorithm is.
4.1 Illustration: Bandit Problems
Let us consider a trivial example of a bandit problem to illustrate the importance of exploitation and exploration. The environment has a degenerate state space of one state, in which two actions, and , are available. The space of policies available is stochastic and encoded with one parameter, the probability of taking the first action, which is constrained to be in the interval . We consider two problems, called “HT” (hidden treasure) and “HF” (hidden failure) both of which have the same expected returns for actions: for and for . In HF, always returns , while returns with probability and with probability . In HT, always returns , while returns with probability and with probability . We would expect a greedy learning algorithm to sample near policies that look better under scarce information, tending to choose the suboptimal in the HT problem. This strategy is inferior to blind sampling, which samples uniformly from the policy space and will discover the hidden treasure of faster. By contrast, for the HF problem we would expect the greedy algorithm to do better by initially concentrating on the , which looks better, and discovering the hidden failure sooner than blind sampling.
We ran the learn algorithm from table 3 for different settings of the parameters and . Figure 2 shows the true value of the resulting policies, averaged over runs of the algorithm. While the plots may look discouraging, remember that these problems are in some ways a worsecase situation. The true value of the actions only becomes apparent after sampling on the order of times. The plots support our hypothesis about the relative success of exploitation. However, although acting greedily is somewhat better in HF, it is much worse in HT. This illustrates why, without any prior knowledge of the domain and given a limited number of samples, it is important not to guide sampling too much by optimization.
5 How Much to Sample?
Algorithm  Lower bound on sample complexity 

likelihood ratio  
reusable trajectories 
If we wish to guarantee with probability that the error in the estimate of the value function is less than , we can derive bounds on the necessary sample size , which depend on , , , and the complexity of the hypothesis class expressed by the covering number . Our new result is an extension of the sample complexity bound for the is estimator (Peshkin01) to the wis estimator. We only quote the results here. The key point in the derivation is the fact that
where differs from only by one member trajectory . Two inequalities follow from this fact. Denote . The variance of the wis estimator according to Devroye’s theorem is bounded as
and McDiarmid’s theorem McDiarmid89 gives us a pac bound (for derivation see PeshkinPhD):
which gives a sample complexity bound very similar to these obtained by KearnsTR99 (see next section). It is well known (Rubinstein81) that variance of wis estimate is . The weak dependence on the horizon is interesting and in accordance with empirical findings. The covering number is defined through the value and describes the complexity of a policy class (e.g. reactive policies or finite state controllers) with respect to the structure of a reward function.
5.1 Comparison to vc Bound
The pioneering work by KearnsNIPS99 considers the issue of generating enough information to determine the nearbest policy. We compare our sample complexity results from above with a similar result for their “reusable trajectories” algorithm. Using a random policy (selecting actions uniformly at random), reusable trajectories generates a set of history trees. This information is used to define estimates that uniformly converge to the true values. The algorithm relies on having a generative model of the environment, which allows simulation of a reset of the environment to any state and the execution of any action to sample an immediate reward. The reuse of information is partial: the estimate of a policy value is built only on the subset of experiences that are “consistent” with the estimated policy.
We will make a comparison based on a sampling policy that selects one of two actions uniformly at random: . For the horizon , this gives us an upper bound on the likelihood ratio:
Substituting this expression for , we can compare our bounds to the bound of KearnsNIPS99 as presented in table 4. The metric entropy takes the place of the VC dimension in terms of policy class complexity. Metric entropy is a more refined measure of capacity than VC dimension; the VC dimension is an upper bound on the growth function which is an upper bound on the metric entropy (Vapnik98).
5.2 Illustration: LoadUnload Problem
The complexity of the problem is measured by the covering number, . It encodes the complexity of the combination of the pomdp and the policy class. We use the loadunload problem of figure 3 to illustrate the effect of policy complexity. The agent is a cart designed to shuttle loads back and forth between two endpoints on a line. The cart does not have sensors to indicate whether it is loaded or unloaded, but it can determine its position on the line. The optimal policy is one where the cart moves back and forth between the leftmost and rightmost states moving as many loads as possible. To do so requires some form of memory. For our example, we will use finitestate controllers with fixed memory sizes (SheltonPhD; PeshkinPhD).
A finitestate controller is a class of policies with a fixed memory size. The controller has its own internal memory state that is restricted to one of a finite number of values. At each time step, it not only selects which action to take, but also a memory state for the next time step. The controller’s choice of action and next memory state are independent of the past given the current observation and memory state. This model is an extension of the reactive policy class to allow the controller to remember a small amount about the past. Finitestate controllers have the capability of remembering information for an arbitrarily long period of time.
Figure 4 demonstrates the effect of policy complexity on the performance of the algorithm. This plot is the same as the ones in figure 2 except that the exploitation probability, , has been fixed at . The four lines depict results for different policy classes, . The solid line is for reactive policies (policies with no memory) whereas the dashed and dotted lines are for finitestate controllers with varying amounts of memory. Only one bit of memory is required to perform optimally in this environment. Using more than two states of memory is superfluous. We can see that the simpler the policy class, the more quickly the algorithm converges. However, with too simple a policy class (i.e. reactive policies), the convergence is to a suboptimal policy. For comparison, the thin dashed line presents the behaviour of reinforce with two internal state controller. As we have seen reinforce forgets past experience and picks up very slowly with the size of experience.
6 Discussion
Likelihoodratio estimation seem to show promise in using data efficiently. The pick_sample routine we present here is only one (simplistic) method of balancing exploitation and exploration. More sophisticated methods including maintaining a distribution over the space of policies might allow for a better balance and the possibility of learning a useful sampling bias in a policy space for a particular application domain and transferring it from one learning problem in that domain to another. In general, estimating the variance of the proxy evaluator could aid in selecting new samples for either exploration or exploitation.
Where reinforce keeps only the most recent sample, our algorithm keeps all of the samples. If a large amount of data is collected, it may be necessary to employ a method between these two extremes and remember a representative set of the samples. Deciding which samples to “forget” would be a difficult, but crucial, task.
Sample complexity bounds depend on the covering number as a measure of the complexity of the policy space. Estimating the covering number is a challenging problem in itself. However it would be more desirable to find a constructive solution to a covering problem in a sense of universal prediction theory (Merhav98; Shtarkov87). Obviously, given a covering number there might be several ways to cover the space. Finding a covering set would be equivalent to reducing a global optimization problem to an evaluation of several representative policies.
Another way to use sample complexity results is to find what is the minimal experience necessary to be able to provide the estimate for any policy in the class with a given confidence. This would be similar to the structural risk minimization principal by Vapnik Vapnik98. The intuition is that given very limited data, one might prefer to search a primitive class of hypotheses with high confidence, rather than to get lost in a sophisticated class of hypotheses due to low confidence.
Acknowledgments
The authors would like to thank Leslie Kaelbling for helpful discussions and comments on the manuscript. C.S. was supported by grants from ONR contracts Nos. N000149313085 & N000149510600, and NSF contracts Nos. IIS9800032 & DMS9872936 while at MIT and by ONR contract N000140010637 under the MURI program “Decision Making under Uncertainty” while at Stanford. L.P. was supported by ARO grant #DAA190110601.
References
 Baird & Moore (1999) Baird and Moore][1999]Baird99 Baird, L. C., & Moore, A. W. (1999). Gradient descent for general reinforcement learning. Advances in Neural Information Processing Systems (pp. 968–974).

Baxter & Bartlett (2000)
Baxter and Bartlett][2000]BaxterICML00
Baxter, J., & Bartlett, P. (2000).
Reinforcement learning in POMDP’s via direct gradient ascent.
Proceedings of the Seventeenth International Conference on Machine Learning
(pp. 41–48).  Glynn (1986) Glynn][1986]Glynn86 Glynn, P. W. (1986). Optimization of stochastic systems. Proceedings of the Winter Simulation Conference.
 Glynn (1989) Glynn][1989]Glynn89 Glynn, P. W. (1989). Importance sampling for stochastic simulations. Management Science, 35, 1367–1392.
 Glynn (1990) Glynn][1990]Glynn90 Glynn, P. W. (1990). Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33, 75–84.

Glynn (1994)
Glynn][1994]Glynn94
Glynn, P. W. (1994).
Importance sampling for Markov chains: asymptotics for the variance.
Communication Statistics — Stochastic Models, 10, 701–717.  Kearns et al. (1999a) Kearns et al.][1999a]KearnsNIPS99 Kearns, M., Mansour, Y., & Ng, A. Y. (1999a). Approximate planning in large POMDPs via reusable trajectories. Advances in Neural Information Processing Systems (pp. 1001–1007).
 Kearns et al. (1999b) Kearns et al.][1999b]KearnsTR99 Kearns, M., Mansour, Y., & Ng, A. Y. (1999b). Approximate planning in large POMDPs via reusable trajectories (Technical Report). AT&T.
 Marbach (1998) Marbach][1998]Marbach Marbach, P. (1998). Simulationbased methods for Markov decision processes. Doctoral dissertation, MIT.
 McDiarmid (1989) McDiarmid][1989]McDiarmid89 McDiarmid, C. (1989). Surveys in combinatorics, chapter On the method of bounded differences, 148–188. Cambridge University Press.
 Merhav & Feder (1998) Merhav and Feder][1998]Merhav98 Merhav, N., & Feder, M. (1998). Universal prediction. IEEE Transactions on Information Theory, 44, 2124–2147. Invited paper.
 Meuleau et al. (2000) Meuleau et al.][2000]MeuleauTR00 Meuleau, N., Peshkin, L., & Kim, K.E. (2000). Exploration in gradientbased reinforcement learning (Technical Report AIMEMO1713). MIT.
 Meuleau et al. (1999) Meuleau et al.][1999]MeuleauUAI99 Meuleau, N., Peshkin, L., Kim, K.E., & Kaelbling, L. P. (1999). Learning finitestate controllers for partially observable environments. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 427–436). Morgan Kaufmann.
 Ormoneit & Glynn (2000) Ormoneit and Glynn][2000]Glynn00 Ormoneit, D., & Glynn, P. W. (2000). Kernelbased reinforcement learning in averagecost problems: An application to optimal portfolio choice. Advances in Neural Information Processing Systems (pp. 1068–1074).
 Peshkin (2001) Peshkin][2001]PeshkinPhD Peshkin, L. (2001). Architectures for policy search. Doctoral dissertation, Brown University, Providence, RI 02912.
 Peshkin et al. (2000) Peshkin et al.][2000]Peshkin00 Peshkin, L., Kim, K.E., Meuleau, N., & Kaelbling, L. P. (2000). Learning to cooperate via policy search. Sixteenth Conference on Uncertainty in Artificial Intelligence (pp. 307–314). San Francisco, CA: Morgan Kaufmann.
 Peshkin et al. (1999) Peshkin et al.][1999]Peshkin99 Peshkin, L., Meuleau, N., & Kaelbling, L. P. (1999). Learning policies with external memory. Proceedings of the Sixteenth International Conference on Machine Learning (pp. 307–314). San Francisco, CA: Morgan Kaufmann.
 Peshkin & Mukherjee (2001) Peshkin and Mukherjee][2001]Peshkin01 Peshkin, L., & Mukherjee, S. (2001). Bounds on sample size for policy evaluation in Markov environments. Proceedings of the Fourteenth Annual Conference on Computational Learning Theory (pp. 608–615).
 Precup et al. (2000) Precup et al.][2000]PrecupICML00 Precup, D., Sutton, R. S., & Singh, S. P. (2000). Eligibility traces for offpolicy policy evaluation. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 759–766).
 Rubinstein (1981) Rubinstein][1981]Rubinstein81 Rubinstein, R. Y. (1981). Simulation and the Monte Carlo method. New York, NY: Wiley.
 Shelton (2001a) Shelton][2001a]SheltonPhD Shelton, C. R. (2001a). Importance sampling for reinforcement learning with multiple objectives. Doctoral dissertation, MIT.
 Shelton (2001b) Shelton][2001b]Shelton01 Shelton, C. R. (2001b). Policy improvement for POMDPs using normalized importance sampling. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (pp. 496–503).
 Shtarkov (1987) Shtarkov][1987]Shtarkov87 Shtarkov, Y. M. (1987). Universal sequential coding of single measures. Problems of Information Transmission, 175–185.
 Sutton & Barto (1998) Sutton and Barto][1998]SutBar98 Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. The MIT Press.
 Sutton et al. (1999) Sutton et al.][1999]SuttonNIPS99 Sutton, R. S., McAllester, D., Singh, S. P., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems (pp. 1057–1063).
 Szepesvari (1999) Szepesvari][1999]Csaba Szepesvari, C. (1999). personal communication.
 Vapnik (1998) Vapnik][1998]Vapnik98 Vapnik, V. (1998). Statistical learning theory. Wiley.
 Williams & Peng (1991) Williams and Peng][1991]Williams91 Williams, R., & Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3, 241–268.
 Williams (1992) Williams][1992]Williams92 Williams, R. J. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.
Comments
There are no comments yet.