1 Introduction
The goal of Reinforcement Learning (RL) and more generally sequential decision making is to learn an optimal policy for performing a certain task within an environment, modeled by a Markov Decision Process (MDP). In RL, the dynamics of the environment are considered as unknowns. This means that to obtain an optimal policy, the learner interacts with its environment, observing the outcomes of the actions it performs. Though well understood from a theoretical point of view, RL still faces many practical issues related to the complexity of the environment, in particular when dealing with large state or action sets. Currently, using function approximation to better represent and generalize over the environment is a common approach for dealing with large
state sets. However, learning with large action sets has been less explored and remains a key challenge.When the number of possible actions is neither on the scale of ‘a few’ nor outright continuous, the situation becomes difficult. In particular cases where the action space is continuous (or nearly so), a regularity assumption can be made on the consequences of the actions concerning either a certain smoothness or Lipschitz property over the action space Lazaric et al. (2007); Bubeck et al. (2011); Negoescu et al. (2011). However, situations abound in which the set of actions is discrete, but the number of actions lies somewhere between 10 and (or greater) — Go, Chess, and planning problems are among these. In the common case where the action space shows no regularity, it is not possible to gain knowledge regarding the consequence of an action that has never been applied — subsampling is therefore not an option.
In this article, we present an algorithm which can intelligently subsample even completely irregular action spaces. Drawing from ideas used in multiclass supervised learning,
we introduce a novel way to significantly reduce the complexity of learning (and acting) with large action sets. By assigning a multibit code to each action, we create binary clusters of actions through the use of Error Correcting Output Codes (ECOCs) Dietterich & Bakiri (1995). Our approach is anchored in Rollout Classification Policy Iteration (RCPI) Lagoudakis & Parr (2003), an algorithm well know for its efficiency on realworld problems. We begin by proposing a simple way to reduce the computational cost of any policy by leveraging the clusters of actions defined by the ECOCs. We then extend this idea to the problem of learning, and propose a new RL method that allows one to find an approximated optimal policy by solving a set of 2action MDPs. While our first model — ECOCextended RCPI (ERCPI) — reduces the overall learning complexity from to , our second method — BinaryRCPI (BRCPI) — reduces this complexity even further, to .The paper is organized as follow: We give a brief overview of notation and RL in Section 2.1, then introduce RCPI and ECOCs in Sections 2.2 and 2.3 respectively. We present the general idea of our work in Section 3. We show how RCPI can be extended using ECOCs in 3.2, and then explain in detail how an MDP can be factorized to accelerate RCPI during the learning phase in 3.3. An indepth complexity analysis of the different algorithms is given in Section 3.4. Experimental results are provided on two problems in Section 4. Related work is presented in Section 5.
2 Background
In this section, we cover the three key elements to understanding our work: Markov Decision Problems, Rollout Classification Policy Iteration, and ErrorCorrecting Output Codes.
2.1 Markov Decision Process
Let a Markov Decision Process be defined by a 4tuple .

is the set of possible states of the MDP, where denotes one state of the MDP.

is the set of possible actions, where denotes one action of the MDP.

is the MDP’s transition function, and defines the probability of going from state
to state having chosen action : . 
is a reward function defining the expected immediate reward of taking action in state . The actual immediate reward for a particular transition is denoted by .
In this article, we assume that the set of possible actions is the same for all states, but our work is not restricted to this situation; the set of actions can vary with the state without any drawbacks.
Let us define a policy, , providing a mapping from states to actions in the MDP. In this paper, without loss of generality, we consider that the objective to fulfill is the optimization of the expected sum of discounted rewards from a given set of states : .
A policy’s performance is measured w.r.t. the objective function . The goal of RL is to find an optimal policy that maximizes the objective function: .
In an RL problem, the agent knows both and , but is not given the environment’s dynamics defined by and . In the case of our problems, we assume that the agent may start from any state in the MDP, and can run as many simulations as necessary until it has learned a good policy.
2.2 Rollout Classification Policy Iteration
We anchor our contribution in the framework provided by RCPI Lagoudakis & Parr (2003)
. RCPI belongs to the family of Approximate Policy Iteration (API) algorithms, iteratively improving estimates of the
function — . In general, API uses a policy to estimate through simulation, and then approximates it by some form of regression on the estimated values, providing . This is done first with an initial (and often random) policy , and is iteratively repeated until is properly estimated. is estimated by running rollouts i.e. MonteCarlo simulations using to estimate the expected reward. The new policy is thus the policy that chooses the action with the highest value for each state.In the case of RCPI, instead of using a function approximator to estimate , the best action for a given is selected using a classifier, without explicitly approximating the Qvalue. This estimation is usually done using a binary classifier over the stateaction space such that the new policy can be written as:
(1) 
The classifier’s training set is generated through MonteCarlo sampling of the MDP, estimating the optimal action for each state sampled. Once generated, these optimal stateaction pairs
are used to train a supervised classifier; the state is interpreted as the feature vector, and the action
as the state’s label. In other words, RCPI is an API algorithm that uses Monte Carlo simulations to transform the RL problem into a multiclass classification problem.2.3 ErrorCorrecting Output Codes
In the domain of multiclass supervised classification in large label spaces, ECOCs have been in use for a while Dietterich & Bakiri (1995). We will cover ECOCs very briefly here, as their adaptation to an MDP formalism is well detailed in Section 3.2.
Given a multiclass classification task with a label set , the class labels can be encoded as binary integers using as few as bits. ECOCs for classification assume that each label is associated to a binary code of length^{3}^{3}3Different methods exist for generating such codes. In practice, it is customary to use redundant codes where . with .
The main principle of multiclass classifiers with ECOCs is to learn to predict the output code instead of directly predicting the label, transforming a supervised learning problem with classes into a set of binary supervised learning problems. Once trained, the class of a datum can be inferred by passing the datum to all the classifiers and concatenating their output into a predicted label code: . The predicted label is thus the label with the closest code in terms of Hamming distance. As a side note, Hamming distance lookups can be done in logarithmic time by using treebased approaches such as kd trees Bentley (1975). ECOCs for classification can thus infer with a complexity of .
3 Extended & Binary RCPI
We separate this paper’s contributions into two parts, the second part building on the first one. We begin by showing how ECOCs can be easily integrated into a classifierbased policy, and proceed to show how the ECOC’s coding matrix can be used to factorize RCPI into a much less complex learning algorithm.
3.1 General Idea
The general idea of our two algorithms revolves around the use of ECOCs for representing the set of possible actions, . This approach assigns a multibit code of length to each of the actions. The codes are organized in a coding matrix, illustrated in Figure 1 and denoted . Each row corresponds to one action’s binary code, while each column is a particular dichotomy of the action space corresponding to that column’s associated bit . In effect, each column is a projection of the dimensional action space into a 2dimensional binary space. We denote as the row of , which corresponds to ’s binary code. corresponds to bit of action ’s binary code.
Our main idea is to consider that each bit corresponds to a binary subpolicy denoted . By combining these subpolicies, we can derive the original policy one wants to learn as such:
(2) 
where is the binary code for action , and is the Hamming distance. For a given a state , each subpolicy provides a binary action , thus producing a binary vector of length . chooses the action with the binary code that has the smallest Hamming distance to the concatenated output of the binary policies.
We propose two variants of RCPI that differ by the way they learn these subpolicies. ECOCextended RCPI (ERCPI) replaces the standard definition of by the definition in Eq. (2), both for learning and action selection. The BinaryRCPI method (BRCPI) relaxes the learning problem and considers that all the subpolicies can be learned independently on separate binaryactioned MDPs, resulting in a very rapid learning algorithm.
3.2 ECOCExtended RCPI
ERCPI takes advantage of the policy definition in Equation (2) to decrease RCPI’s complexity. The subpolicies — — are learned simultaneously on the original MDP, by extending the RCPI algorithm with an ECOCencoding step, as described in Algorithm 1. As any policy improvement algorithm, ERCPI iteratively performs the following two steps:
Simulation Step: This consists in performing Monte Carlo simulations to estimate the quality of a set of stateaction pairs. From these simulations, a set of training examples is generated, in which data are states, and labels are the estimated best action for each state.
Learning Step: For each bit , is used to create a binary label training set . Each is then used to train a classifier , providing subpolicy as in Eq. (1). Finally, the set of subpolicies are combined to provide the final improved policy as in Eq. (2).
ERCPI’s training algorithm is presented in Alg. 1.
The Rollout function used by ERCPI is identical to the one used by RCPI — is used to estimate a certain stateaction tuple’s expected reward, .
Up to line 12 of Algorithm 1, ERCPI is in fact algorithmically identical to RCPI, with the slight distinction that only the best tuples are kept, as is usual when using RCPI with a multiclass classifier.
ERCPI’s main difference appears starting line 13; it is here that the original training set is mapped onto the binary action spaces, and that each individual subpolicy is learned. Line 16 replaces the original label of state by its binary label in ’s action space — this corresponds to bit of action ’s code.
The Train function on line 19 corresponds to the training of ’s corresponding binary classifier on . After this step, the global policy is defined according to Eq.(2). Note that, to ensure the stability of the algorithm, the new policy obtained after one iteration of the algorithm is an alphamixture policy between the old and the new obtained by the classifier (cf. line 23).
3.3 Binarized RCPI
ERCPI splits the policy improvement problem into individual problems, but training still needs , thus requiring the full set of binary policies. Additonnally, for each state, all actions have to be evaluated by Monte Carlo simulation (Alg. 1, line 5). To reduce the complexity of this algorithm, we propose learning the binary subpolicies — — independently, transforming our initial MDP into subMDPs, each one corresponding to the environment in which a particular is acting.
Each of the binary policy is dealing with its own particular representation of the action space, defined by its corresponding column in . For training, bestaction selections must be mapped into this binary space, and each of the ’s choices must be combined to be applied back in the original state space.
Let be the action sets associated to such that:
(3)  
For a particular , is the set of original actions corresponding to subaction , and is the set of original actions corresponding to subaction .
We can now define new binary MDPs that we name subMDPs, and denote . They are defined from the original MDP as follows:

, the same stateset as the original MDP.

.

, where is the probability of choosing action , knowing that the subaction applied on the subMDP is . We consider to be uniform for and null for , and vice versa. is the original MDP’s transition probability.

.
Each of these new MDPs represents the environment in which a particular binary policy operates. Each of these MDPs is defined independently from one another, and therefore we can consider each of these MDPs to be a separate RL problem for its corresponding binary policy.
In light of this, we propose to transform RCPI’s training process for the base MDP into new training processes, each one trying to find an optimal for its corresponding . Once all of these binary policies have been trained, they can be used during inference in the manner described in Section 3.2.
The main advantage of this approach is that, since each of the subproblems in Algorithm 2 is modeled as a binaryactioned MDP, increasing the number of actions in the original problem simply increases the number of subproblems logarithmically, without increasing the complexity of these subproblems – see Section 3.4.
Let us now discuss some details of BRCPI, as described in Algorithm 2. BRCPI resembles RCPI very strongly, except that instead of looping over the actions on line 6, BRCPI is only sampling for or actions. However, the inner loop is run times, as can be seen on line 1 of Algorithm 2.
Within the Rollout function (line 7), if chooses subaction ‘’, an action from the original MDP is sampled from following , and the MDP’s transition function is called using this action. This effectively estimates the expected reward of choosing action in state .
As we saw in Section 3.2, each is a different binary projection of the original action set. Each of the classifiers is thus making a decision considering a different split of the action space. Some splits may make no particular sense w.r.t. to the MDP at hand, and therefore the expected return of that particular ’s and may be equal. This does not pose a problem, as that particular subpolicy will simply output noise, which will be corrected for by more pertinent splits given to the other subpolicies.
3.4 Computational Cost and Complexity
We study the computational cost of the proposed algorithms in comparison with the RCPI approach and present their respective complexities.
In order to define this cost, let us consider that is the time spent learning a multiclass classifier on examples with possible outputs, and is the cost of classifying one input.
The computational cost of one iteration of RCPI or ERCPI is composed of both a simulation cost — which corresponds to the time spent making Monte Carlo Simulation using the current policy — and a learning cost which corresponds to the time spent learning the classifier that will define the next policy^{4}^{4}4In practice, when there are many actions, simulation cost is significantly higher than learning cost, which is thus ignored Lazaric et al. (2010).. This cost takes the following general form:
(4) 
where is the cost of sampling one trajectory of size , is the cost of executing the Monte Carlo Simulations over states testing possible actions, and is the cost of learning the corresponding classifier^{5}^{5}5We do not consider the computational cost of transitions in the MDP..
The main difference between RCPI and ERCPI comes from the values of and . When comparing ERCPI with a RCPI algorithm using a onevsall (RCPIOVA) multiclass classifier — one binary classifier learned for each possible action — it is easy to see that our method reduces both and by a factor of — cf. Table 1.
Algorithm  Simulation Cost  Learning Cost 

RCPIOVA  
ERCPI  
BRCPI 
Method  RCPI OVA  ERCPI  BRCPI 

Complexity 
When considering the BRCPI algorithm, and are reduced as in ERCPI. However, the simulation cost is reduced as well, as our method proposes to learn a set of optimal binary policies on binary subMDPs. For each of these subproblems, the simulation cost is since the number of possible actions is only 2. The learning cost corresponds to learning only binary classifiers resulting in a very low cost — cf. Table 1. The overall resulting complexity w.r.t. to the number of actions is presented in Table 2, showing that the complexity of BRCPI is only logarithmic. In addition, it is important to note that each of the BRCPI subproblems is atomic, and are therefore easily parallelized. To illustrate these complexities, computation times are reported in the experimental section.
4 Experiments
In this paper, our concern is really about being able to deal with a large number of uncorrelated actions in practice. Hence, the best demonstration of this ability is to provide an experimental assessment of ERCPI and BRCPI. In this section, we show that BRCPI exhibits very important speedups, turning days of computations into hours or less.
4.1 Protocol
We evaluate our approaches on two baseline RL problems: Mountain Car and Maze.
The first problem, Mountain Car, is wellknown in the RL community. Its definition varies, but it is usually based on a discrete and small set of actions (2 or 3). However, the actions may be defined over a continuous domain, which is more “realistic”. In our experiment, we discretize the range of accelerations to obtain a discrete set of actions. Discretization ranges from coarse to fine in the experiments, thus allowing us to study the effect of the size of the action set on the performance of our algorithms. The continuous state space is handled by way of tiling Sutton (1996). The reward at each step is 1, and each episode has a maximum length of 100 steps. The overall reward thus measures the ability of the obtained policy to push the car up to the mountain quickly.
The second problem, Maze, is a 50x50 gridworld problem in which the learner has to go from the left side to the right side of a grid. Each cell of the grid corresponds to a particular negative reward, either , , or . For the simplest case, the agent can choose either to move up, down, or right, resulting in a 3action MDP. We construct more complex action sets by generating all sequences of actions of a defined length i.e. for length 2, the 6 possible actions are upup, upright, downup, etc. Contrary to Mountain Car, there is no notion of similarity between actions in this maze problem w.r.t. their consequences. Each state is represented by a vector of features that contains the information about the different types of cells that are contained in a 5x5 grid around the agent. The overall reward obtained by the agent corresponds to its ability to go from the left to the right, avoiding cells with high negative rewards.
In both problems, training and testing states are sampled uniformly in the space of the possible states. We have chosen to sample states for each problem, the number of trajectories made for each stateaction pair is
. The binary base learner is a hingeloss perceptron learned with 1000 iterations by stochastic gradientdescent algorithm. The error correcting codes have been generated using a classical random procedure as in
Berger (1999). The value of the alphamixture policy is .4.2 Results
The average rewards obtained after convergence of the three algorithms are presented in Figures 3 and 2 with a varying number of actions. The average reward of a random policy is also illustrated. First of all, one can see that RCPIOVA and ERCPI perform similarly on both problems except for Maze with 243 actions. This can be explained by the fact that OVA strategies are not able to deal with problems with many classes when they involve solving binary classification problems with few positive examples. In this setting, ECOCclassifiers are known to perform better. BRCPI achieves lower performances than OVARCPI and ERCPI. Indeed, BRCPI learns optimal independent binary policies that, when used together, only correspond to a suboptimal overall policy. Note that even with a large number of actions, BRCPI is able to learn a relevant policy — in particular, Maze with 719 actions shows BRCPI is clearly better than the random baseline, while the other methods are simply intractable. This is a very interesting result since it implies that BRCPI is able to find nontrivial policies when classical approaches are intractable.
Table 3 provides the computation times for one iteration of the different algorithms for Mountain Car with 100 actions. ERCPI speedsup RCPI by a factor 1.4 while BRCPI is 12.5 times faster than RCPI, and 23.5 times faster when considering only the simulation cost. This explains why Figure 3 does not show performances obtained by RCPI and ERCPI on the maze problem with 719 actions: in that setting, one iteration of these algorithms takes days while only requiring a few hours with BRCPI. Note that these speedup values increase with the number of actions.
At last, Figure 4 gives the performance of BRCPI depending on the number of rollouts, and shows that a better policy can be found by increasing the value of . Note that, even if we use a large value of , BRCPI’s running time remains low w.r.t. to OVARCPI and ERCPI.
Mountain Car  100 Actions  46 bits  

Sim.  Learning  Total  Speedup  
OVA  4,312  380  4,698  
ERCPI  3,188  190  3,378  
BRCPI  184  190  374 
5 Related Work
Rollout Classification Policy Iteration Lagoudakis & Parr (2003) provides an algorithm for RL in MDPs that have a very large state space. RCPI’s MonteCarlo sampling phase can be very costly, and a couple approaches have been provided to better sample the state space Dimitrakakis & Lagoudakis (2008), thus leading to speedups when using RCPI. Recently, the effectiveness of RCPI has been theoretically assessed Lazaric et al. (2010). The well known efficiency of this method for realworld problems and its inability to deal with many actions have motivated this work. Reinforcement Learning has long been able to scale to statespaces with many (if infinite) states by generalizing the valuefunction over the state space Tham (1994); Tesauro (1992). Tesauro first introduced rollouts Tesauro & Galperin (1997), leveraging MonteCarlo sampling for exploring a large state and action space. Dealing with large action spaces has additionally been considered through sampling or gradient descent on Negoescu et al. (2011); Lazaric et al. (2007), but these approaches assume a wellbehaved function, which is hardly guaranteed.
There is one vein of work reducing actionspace lookups logarithmically by imposing some form of binary search over the action space Pazis & Lagoudakis (2011); Pazis & Parr (2011). These approaches augment the MDP with a structured search over the action space, thus placing the action space’s complexity in the state space. Although not inspirational to ERCPI, these approaches are similar in their philosophy. However, neither proposes a solution to speeding up the learning phase as BRCPI does, nor do they eschew value functions by relying solely on classifierbased approaches as ERCPI does. ErrorCorrecting Output Codes were first introduced by Dietterich and Bakiri (1995) for use in the case of multiclass classification. Although not touched upon in this article, coding dictionary construction can be a key element to the ability of the ECOCbased classifier’s abilitiesBeygelzimer et al. (2008). Although in our case we rely on randomly generated codes, codes can be learned from the actual training data Crammer & Singer (2002) or from an a priori metric upon the classes space or a hierarchy Cissé et al. (2011).
6 Conclusion
We have proposed two new algorithms which aim at obtaining a good policy while learning faster than the standard RCPI algorithm. ERCPI is based on the use of Error Correcting Output Codes with RCPI, while BRCPI consists in decomposing the original MDP in a set of binaryMDPs which can be learned separately at a very low cost. While ERCPI obtains equivalent or better performances than the classical One Vs. All RCPI implementations at a lower computation cost, BRCPI allows one to obtain a suboptimal policy very fast, even if the number of actions is very large. The complexity of the proposed solutions are and respectively, in comparison to RCPI’s complexity of . Note that one can use BRCPI to discover a good policy, and then ERCPI in order to improve this policy; this practical solution is not studied in this paper.
This work opens many new research perspectives: first, as the performance of BRCPI directly depends on the quality of the codes generated for learning, it can be very interesting to design automatic methods able to find the welladapted codes, particularly when one has a metric over the set of possible actions. From a theoretical point of view, we plan to study the relation between the performances of the subpolicies in BRCPI and the performance of the final obtained policy . At last, the fact that our method allows one to deal with problems with thousands of discrete actions also opens many applied perspectives, and can allow us to find good solutions for problems that have never been studied before because of their complexity.
References
 Bentley (1975) Bentley, Jon Louis. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.

Berger (1999)
Berger, A.
Errorcorrecting output coding for text classification.
In
Workshop on Machine Learning for Information Filtering, IJCAI ’99
, 1999.  Beygelzimer et al. (2008) Beygelzimer, A., Langford, J., and Zadrozny, B. Machine learning techniques—reductions between prediction quality metrics. Performance Modeling and Engineering, pp. 3–28, 2008.
 Bubeck et al. (2011) Bubeck, S., Munos, R., Stoltz, G., Szepesvári, C., et al. Xarmed bandits. Journal of Machine Learning Research, 12:1655–1695, 2011.
 Cissé et al. (2011) Cissé, M., Artieres, T., and Gallinari, Patrick. Learning efficient error correcting output codes for large hierarchical multiclass problems. In Workshop on LargeScale Hierarchical Classification ECML/PKDD ’11, pp. 37–49, 2011.
 Crammer & Singer (2002) Crammer, Koby and Singer, Yoram. On the Learnability and Design of Output Codes for Multiclass Problems. Machine Learning, 47(2):201–233, 2002.
 Dietterich & Bakiri (1995) Dietterich, T.G. and Bakiri, G. Solving multiclass learning problems via errorcorrecting output codes. Jo. of Art. Int. Research, 2:263–286, 1995.
 Dimitrakakis & Lagoudakis (2008) Dimitrakakis, Christos and Lagoudakis, Michail G. Rollout sampling approximate policy iteration. Machine Learning, 72(3):157–171, July 2008.
 Lagoudakis & Parr (2003) Lagoudakis, Michail G. and Parr, Ronald. Reinforcement learning as classification: Leveraging modern classifiers. In Proc. of ICML ’03, 2003.
 Lazaric et al. (2007) Lazaric, Alessandro, Restelli, Marcello, and Bonarini, Andrea. Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods. In Proc. of NIPS ’07, 2007.
 Lazaric et al. (2010) Lazaric, Alessandro, Ghavamzadeh, Mohammad, and Munos, Rémi. Analysis of a classificationbased policy iteration algorithm. In Proc. of ICML ’10, pp. 607–614, 2010.
 Negoescu et al. (2011) Negoescu, D.M., Frazier, P.I., and Powell, W.B. The knowledgegradient algorithm for sequencing experiments in drug discovery. INFORMS J. on Computing, 23(3):346–363, 2011.
 Pazis & Lagoudakis (2011) Pazis, Jason and Lagoudakis, Michail G. Reinforcement Learning in Multidimensional Continuous Action Spaces. In Proc. of Adaptive Dynamic Programming and Reinf. Learn., pp. 97–104, 2011.
 Pazis & Parr (2011) Pazis, Jason and Parr, Ronald. Generalized Value Functions for Large Action Sets. In Proc. of ICML ’11, pp. 1185–1192, 2011.
 Sutton (1996) Sutton, RS. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Proc. of NIPS ’96, pp. 1038–1044, 1996.
 Tesauro (1992) Tesauro, Gerald. Practical issues in temporal difference learning. Machine Learning, 8:257–277, 1992.
 Tesauro & Galperin (1997) Tesauro, Gerald and Galperin, Gregory R. OnLine Policy Improvement Using MonteCarlo Search. In Proc. of NIPS ’97, pp. 1068–1074, 1997.
 Tham (1994) Tham, C.K. Modular online function approximation for scaling up reinforcement learning. PhD thesis, University of Cambridge, 1994.
Comments
There are no comments yet.