Conventional tabular reinforcement learning is bottle-necked by the curse of dimensionality for practical applications. The number of parameters that needs to be trained grows exponentially with respect to the size of states and actions. In order to make reinforcement learning practically tractable, one line of research is hierarchical reinforcement learning (HRL), which develops principled ways of temporal and state abstraction to reduce the dimensionality for sequential decision making.
The basic idea of temporal abstraction is to develop macro-actions that take several steps to terminate before returning. Usually good macro-actions aim to solve sub-goals, so that multiple macro-actions divide difficult tasks into several simpler ones. In addition, state abstraction tries to reduce the dimensionality by removing irrelevant state variables for decision making, reducing the cardinality of state space and helping in tackling over-fitting. These two techniques lead to natural hierarchical control architecture, which intuitively resembles how humans solve complex tasks.
Another area of research closely related to our work is batch reinforcement learning. Batch reinforcement learning aims to learn the best policy from a fixed set of prior-known samples. Compared to on-policy algorithms, batch reinforcement learning enjoys stability and data-efficiency. More importantly, it allows to apply reinforcement learning in a practical problem that is expensive in collecting new samples, such as education, spoken dialog system and medical systems. Well-known algorithms in batch reinforcement learning include Least Square Policy Iteration (LSPI) , Fitted Q iteration (FQI) , Neural Fitted Q Iteration (NFQ)  and etc.
In this paper, we combine batch learning and hierarchical reinforcement learning, in order to achieve faster learning speed, data efficiency and model comparison.
2 Related Work
There are three major approaches developed relatively independently , aiming to formalize the idea of abstraction into reinforcement learning. The three approaches are: 1) the option framework , 2) Hierarchies of Abstract Machines (HAMs)  and 3) MAXQ framework .
Under the option framework, the developers augment the original action set by options, which are macro actions that have their own predefined policy, termination state and active state. Sutton et al have shown that such a system is a semi-Markov Decision Process (SMDP), which converges to a unique hierarchical optimal solution using a modified Q-learning algorithm. For HAM framework, rather than giving out the entire policy of these macro actions, developers just need to specify a partial program that specifies a part of the policy. Using HAMQ learning , HAM can also converge to a hierarchical optimal solution.
At last, MAXQ framework provides an elegant formulation that decomposes the original MDP into several subroutines in a hierarchy and the algorithm can learn policies recursively for all the subroutines. Therefore, in the MAXQ framework, there is no need to specify the policy for any macro-actions. However, Dietterich shows that it can only achieve recursive optimal solution, which in the extreme case, can be arbitrarily worse than the hierarchical optimal solution.
All of the above work assumes that the agent can interact with the world while learning. However, in real-world applications that needs HRL, it is usually very expensive to collect data and terrible failures are not allowed on operation. This forbids the usage of online learning algorithms that could potentially preform horribly in the early learning stage. To our best knowledge, there is little prior work  in developing batch learning algorithms that allow a hierarchical SMDP to be trained from an existing data set collected from a stochastic behavior policy. We believe that such algorithms are valuable for applying HRL in complex practical domains.
3 Batch Learning for HSMDP
Mostly, we follow the definitions in the MAXQ framework. However, for notation simplicity, we also borrow some notations from the framework.
3.2 Markov Decision Process
An MDP is described by
is the state space of
is a set of primitive actions that are available
defines the transition probability of executing primitive actionin state,
is the reward function defined over and
3.3 Hierarchical Decomposition
An MDP, , can be decomposed into a finite set of subtasks with the convention that is the root subtask, i.e. solving solves the entire original MDP, . is then a Semi-Markov Decision Process (SMDP) that shares the same , , with , and has an extra tuple , where:
is the termination predicate of subtask that partition into a set of active states, and a set of terminal states . If enters a state in , and its subtasks exit immediately, i.e. if , otherwise .
is a nonempty set of actions that can be performed by . The actions can be either primitive actions from or other subtask, , where . We will refer to as the children of subtask .
It is evident that a valid hierarchical decomposition forms a direct acyclic graph (DAG) where each non-terminal node corresponds to a subtask, and each terminal node corresponds to a primitive action. For later discussion, we will use hierarchical decomposition and DAG interchangeably.
3.4 Hierarchical Policy
A hierarchical policy, , is a set of policies for each subtask, , . In the terminology of framework, a subtask policy is a deterministic , with for , and otherwise.
3.5 Recursive Optimality
A recursive optimal policy for MDP with hierarchical decomposition is a hierarchical policy , such that for each subtask, , the corresponding policy is optimal for the SMDP defined by the set of states, , the set of actions , the state transition probability , and the rewards function .
The problem formulation is as following: given any finite set of samples, and any valid hierarchical decomposition of the original MDP , we wish to learn the recursive optimal hierarchical policy .
We now propose Hierarchical Q-value Iteration, HQI, and we prove that it converges to the recursive optimal solution for any hierarchical decomposition given that the batch sample distribution has sufficient state action exploration. The basic idea is to train every subtask using Subtask Q-value Iteration (SQI) in a bottom up fashion. The training prerequisite of SQI for a specific subtask is that all of its children have converged to their greedy optimal policies. In order to fulfil this constraint, HQI first topologically sorts the DAG and running SQI from subtasks whose children have only primitive actions. After those subtasks converge to their optimal policy, the algorithm continues to other subtasks whose children are either converged or primitive actions. We will show that there always exist an ordering of training every subtask in a valid DAG that fulfills the prerequisite of SQI.
One challenge of training a subtask with subtask children is that we cannot use the optimal SMDP Bellman equation described in the MAXQ framework , , which is the Q-value function for subtask, , at state, and action :
The main problem of this equation(1
) is that in order to estimate the Q-value for a subtask children,, the parent needs to have an estimate about the transition probability , which is the distribution of ’s exit state and number of primitive steps needed to reach its termination. Although the termination states of the child are given by
, it is difficult to estimate the joint distribution of termination stepsand ’ if follows an policy that is different from the behavior policy without recollecting new samples. This is because since the behavior policy is usually random and poor in performance, the collected samples do not provide information about how many steps the subtask would take to terminate if following a different (optimal) policy.
Therefore, instead of using the above Bellman equation that updates the Q table of the parent when a child exits, we use the intra-option Bellman equation proposed in the framework :
Equation (3) also yeilds a contraction in the max norm and is able to learn the Q table after observing every new reward, which eliminates the need to estimate . Another key benefit is that we can use flat samples to estimate the one step transition probability and rewards in equation (3), which makes the algorithm independent of the hierarchical decomposition and is able to learn optimal polices for different structures from the same dataset. Specifically, we can estimate the above two terms by and , where is the number of experiences that has and (, ), respectively. At last, since we assume converged subtasks follow deterministic greedy policy, if is the greedy primitive action that subtask would take at state , and otherwise. This step is in fact crucial for HQI to learn the optimal policy because it allows a subtask to discard those samples that are not following the optimal behavior of its children.
The HQI algorithm is summarized in Algorithm 1 and SQI is summarized in Algorithm 2. Any dataset can be used at every iteration of SQI. If the initial data is sufficient to cover important state-action space, the same dataset is able to train all subtasks of the DAG.
4.1 Extension to Function Approximation and State Abstraction
We note that it is trivial to extend SQI to Fitted-SQI which uses a function approximator to model the Q-value function for a subtask at the end of each iteration. The direct advantage of using function approximation is that it can incorporate powerful supervised regression methods, such as Gaussian Processes or Neural Networks to scale up to large-scale and continuous MDPs. Although using function approximations usually compromises the theoretical convergence guarantee for tabular MDP, our experiments shows that Fitted-HQI is able to converge to the unique optimal solution. The Fitted-SQI is summarized in Algorithm3.
Furthermore, state abstraction here means finding a subset of state variables that are most informative for each subtask. A good hierarchical decomposition decomposes the original MDP into several simpler ones, such that the agent only needs to care about a small set of features in each task. Therefore, a good structure should create easy opportunity for state abstraction at each level of its hierarchy. Many techniques have been explored 
in non-hierarchical batch reinforcement learning to achieve state abstractions. These methods can be directly applied in fitting step of Fitted-SQI and each subtask can learn its own sparse state representation. Due to the space limit, we conduct simple manual state abstraction for each subtask in this paper, and leave the study of analyzing automatic feature selection techniques into future works.
5 Proof of Convergence
In this section, we prove that our HQI (in the tabular case) converges to the recursive optimal policy. Assume that the policy at each subtask is ordered, such that it break ties deterministically (e.g favor left to right), it defines a unique recursive optimal hierarchical policy, , and a corresponding recursive optimal Q function . We then show that HQI converge to and . The subscript refers to recursive optimality.
We want to prove that for an MDP with hierarchical decomposition , HQI converges to recursive optimal policy for the hierarchical policy of , .
We first prove that: For a subtask , with all of its children converged to their recursive optimal policies and infinity amount of batch data, algorithm SQI converge to the optimal Q-value function after infinity number of iterations,
We then show that HQI provides an order of training all the subtasks in the DAG graph, such that when training a subtask, , all of its children, already converged to their optimal recursive policies.
5.2 Proof Sketch
Step 1 We begin with the base case for subtask whose children are all primitive actions. We can notice that equation (3) falls back to traditional Bellman operator for flat MDP, because a primitive action always terminate after one step:
Therefore, for subtask with only primitive children, SQI is equivalent to flat Q-value iteration, which is guaranteed to converge to optimal policy given sufficient data.
Then for subtasks with other subtask children, by definition, when we run SQI, the children of () have converged to their unique deterministic optimal recursive policy. This means that every action , is a deterministic deterministic Markov option as defined in the framework .  proved that ”for any set of deterministic Markov options one step intra-option Q-learning converges w.p. 1 to the optimal Q-values, for every option regardless of what options are executed during learning provided every primitive action gets executed in every state infinitely often”. Refer to the , for the detailed proof.
Step 2 By definition, a hierarchical decomposition is a Directed Acyclic Graph (DAG) with edges from parents to their children. In this proof, we first reverse the edges so that they are from children to their parents. Also we know from Graph Theory that any Directed Acyclic Graph has at least one topological sort, such that every edge , comes before in the ordering . Therefore, we can to topologically sort the hierarchical decomposition with reversed edges such that SQI can always train the children before parents.
Also, the definition of topological sort ensures the initial condition that there is at least one subtask that only has primitive children. Therefore, we can then conclude that for any DAG, HQI can traverse the subtasks such that the conditions of SQI convergence are met. Then, HQI converges for all subtasks.
6.1 Experimental Setup
We applied our algorithm to the Taxi domain described in . This is a simple grid world that contains a taxi, a passenger, and four specially-designated locations labeled R, G, B, and Y. In the starting state, the taxi is in a randomly-chosen cell of the grid, and the passenger is at one of the four special locations. The passenger has a desired destination that he/she wishes to reach, and the job of the taxi is to go to the passenger, pick him/her up, go to the passenger’s destination, and drop the passenger. The taxi has six primitive actions available to it: move one step to one of the four directions (north, south, east and west), pick up the passenger and put down the passenger. To make the task more difficult, the move actions are not deterministic, so that it has chance of moving in one of the other directions. Also, every move in the grid will cost reward. Attempting to pick up or drop passenger at wrong location will cause reward. At last, successfully finish the task has reward. The grid is described in figure 1. Therefore, there are 4 possible state for the destination, 5 possible state for the passenger (4 location and 5 is on the car), 25 possible locations, which results into parameters in the Q-table that needs to be learned. We denote the state variable as for later discussion.
The dataset for each run were collected in advance by choosing actions uniformly at random with different sizes. We evaluate the performance of algorithms by running greedy execution for times to obtain average discounted return at every new samples and up to samples. We repeat the experiments for times to evaluate the influence of different sample distribution. The discounting factor is set to be .
We conducted three sets of experiments: 1) comparison of HQI with flat Q-value Iteration and the effect of state abstraction. 2) learning polices for different DAGs from the same dataset and 3) learning policy using Fitted-HQI with Random Forest as the function approximator.
The first experiment compares HQI against flat Q-value Iteration (FQI). Also, as pointed out in , state abstraction is essential for MAXQ to have fast learning speed compared to flat Q learning. As a result, we manually conduct state abstraction for each subtask in DAG . However, different from the aggressive state abstraction described in , where every subtask and child pair has a different set of state variables, we only conduct a simple state abstraction at subtask level, i.e. all children of a subtask has the same state abstraction. The final state abstraction is listed in Table 1. As described above, we run independent runs with different random samples of different sizes, we report the mean average discounted return over five runs in Figure 3, as well as the best average discounted reward of the five runs in Figure 4.
|get||[pass x y]|
|put||[dest x y]|
|navi_get||[pass x y]|
|navi_put||[dest x y]|
Results show that both HQI with and without state abstraction consistently outperforms the FQI when there is limited training data. When the dataset is large enough, they all converge to the same optimal performance, which is around . We also notice that, occasionally, HQI with state abstraction can learn the optimal performance state abstraction with very limited samples, i.e samples. This demonstrates that with proper hierarchy constraints and good behavioral policy, HQI can generalize much faster than FQI. Moreover, even the HQI without state abstraction consistently outperforms FQI in terms of sample efficiency. This is different from the behavior of the on-policy MAXQ-Q algorithm reported in , which needs state abstraction in order to learn faster than Q-learning. We argue that HQI without state abstraction is more sample efficient than FQI for the following reasons: 1) HQI uses all applicable primitive samples to update the Q-table for every subtask while MAXQ-Q only updates for the subtask that executes that particular action. 2) Upper level subtask in MAXQ-Q needs to wait for its children gradually converges to their greedy optimal policy before it can have have a good estimate of while HQI does not have this limitation.
The second experiment is running HQI on different variations of hierarchical decomposition of the original MDP. Figure 5 and Figure 6 show two different valid DAGs that could also solve the original MDP. Figure 7 demonstrates that with sufficient data all three DAG converge to their recursive optimal solution, which confirms that HQI is able converge for different hierarchies. In terms of sample efficiency, three structures demonstrate slight different behavior. We can notice that DAG learns particularly slower than the other two. We argue that this is because of poor decomposition of the original MDP. Based on the problem settings, pick and drop are all risky actions (illegal execution lead to reward), while in DAG these two actions are mixed with low-cost move actions while the other two DAGs isolated them in a higher level of decision making. Therefore, designing good hierarchy is crucial to obtain performance gain versus flat RL approaches. This emphasizes the importance of the off-policy nature of HQI, which allows developers to experiment with different DAG structures without collecting new samples. How to effectively evaluate the performance of particular hierarchical decomposition without using a simulator is a part of our future research.
The last experiment utilizes Random Forests as the function approximator to model the Q-value function in DAG 1. The main purpose is to demonstrate the convergence of Fitted HQI. For each subtask the Q-value function is modelled by a random forest with as the input feature. Since anddimension vector (4d for destination, 5d for passenger and 2d for the coordinate). We report the mean average discounted rewards over 5 independent runs with different random samples of different sizes. Figure 8 shows that Fitted-HQI achieves similar performance compared to Tabular HQI.
6.3 Comparison with MAXQ-Q and Intra-Option Learning
Compared to MAXQ-Q, HQI enjoys sample efficiency and the ability to be off-policy. The advantage of off-policy is that it does not require hyper-parameter tuning such as exploration rate. Since high level subtask in MAXQ-Q needs to wait for its children converge first, developers usually set a faster exploration decay rate for lower level subtasks, which is an extra hyperparameter that needs tuning. The limitation of HQI is that it maintains an independent Q table for each subtask, while MAXQ-Q allows a part of the parent value function recursively retrieved from its children, a technique known as value function decomposition. This allows more compact memory usage and accelerates the learning of the parents. How to share value function in the off-policy setting is a future research topic.
For intra-option learning in option framework, the main advantage of HQI is that it does not require developers to fully define the policy of every options. Instead, one only needs to define a DAG with terminal predicate for each node in the graph. We argue that in general it is easier to define a task hierarchy than giving a full policy for macro-actions. Therefore, HQI combines the strength of intra-option off-policy learning with MAXQ.  provides a method of training options in an off-policy fashion. Compared to it, HQI has the advantage of learning all subtasks from flat batch dataset, so that our algorithm does not require a task DAG priory to collecting data and a manual definition of option policies.
7 Conclusion and Future Work
In this paper, we introduced an off-policy batch learning algorithm for hierarchical RL. We showed that it is possible to blindly collect data using a random flat policy. Then, we use this data to learn different structures that the data collection were not aware of. Our experiments on the Taxi domain show that it converges faster than FQI to the optimal policy. It also shows that different DAG structures are able to learn from this flat data, with different speeds. Every DAG structure has its own number of parameters, which suggests a possible line for research to try to minimize the number of parameters in the hierarchy. Other future work include comparing different feature selections techniques for Fitted-SQI and applying the algorithm to large-scale and complex domains.
-  Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(1-2):41–77, 2003.
-  Mitchell Keith Bloch. Reducing commitment to tasks with off-policy hierarchical reinforcement learning. arXiv preprint arXiv:1104.5059, 2011.
-  Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Section 22.4: Topological sort. Introduction to Algorithms (2nd ed.), MIT Press and McGraw-Hill, pages 549–552, 2001.
-  Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res.(JAIR), 13:227–303, 2000.
-  Thomas G Dietterich. An overview of maxq hierarchical reinforcement learning. In Abstraction, Reformulation, and Approximation, pages 26–44. Springer, 2000.
Damien Ernst, Pierre Geurts, and Louis Wehenkel.
Tree-based batch mode reinforcement learning.
Journal of Machine Learning Research, pages 503–556, 2005.
-  Alborz Geramifard, Thomas J Walsh, Nicholas Roy, and Jonathan How. Batch-ifdd for representation expansion in large mdps. arXiv preprint arXiv:1309.6831, 2013.
-  Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107–1149, 2003.
-  Christopher Painter-Wakefield and Ronald Parr. Greedy algorithms for sparse reinforcement learning. arXiv preprint arXiv:1206.6485, 2012.
-  Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. Advances in neural information processing systems, pages 1043–1049, 1998.
-  Zhiwei Qin, Weichang Li, and Firdaus Janoos. Sparse reinforcement learning via convex optimization. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 424–432, 2014.
-  Martin Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005, pages 317–328. Springer, 2005.
-  Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1):181–211, 1999.