Log In Sign Up

Autonomous Extracting a Hierarchical Structure of Tasks in Reinforcement Learning and Multi-task Reinforcement Learning

Reinforcement learning (RL), while often powerful, can suffer from slow learning speeds, particularly in high dimensional spaces. The autonomous decomposition of tasks and use of hierarchical methods hold the potential to significantly speed up learning in such domains. This paper proposes a novel practical method that can autonomously decompose tasks, by leveraging association rule mining, which discovers hidden relationship among entities in data mining. We introduce a novel method called ARM-HSTRL (Association Rule Mining to extract Hierarchical Structure of Tasks in Reinforcement Learning). It extracts temporal and structural relationships of sub-goals in RL, and multi-task RL. In particular,it finds sub-goals and relationship among them. It is shown the significant efficiency and performance of the proposed method in two main topics of RL.


page 1

page 2

page 3

page 4


Hierarchical Reinforcement Learning Method for Autonomous Vehicle Behavior Planning

In this work, we propose a hierarchical reinforcement learning (HRL) str...

CompoSuite: A Compositional Reinforcement Learning Benchmark

We present CompoSuite, an open-source simulated robotic manipulation ben...

Generic Itemset Mining Based on Reinforcement Learning

One of the biggest problems in itemset mining is the requirement of deve...

Emergency action termination for immediate reaction in hierarchical reinforcement learning

Hierarchical decomposition of control is unavoidable in large dynamical ...

Symmetry Learning for Function Approximation in Reinforcement Learning

In this paper we explore methods to exploit symmetries for ensuring samp...

Hyperbolic Embeddings for Learning Options in Hierarchical Reinforcement Learning

Hierarchical reinforcement learning deals with the problem of breaking d...


Reinforcement learning (RL) is a common approach to learning from delayed rewards via trial and error. RL can have trouble scaling to high-dimensional state spaces due to the

the curse of dimensionality

(Barto and Mahadevan, 2003). Our central thesis is that many of these issues arise from a lack of knowledge about sub-goals and the hidden relationships among them to achieve goals. Hidden relationships that depend on the tasks are a type of hierarchal knowledge: a representative structure can help define an ordering over states and sub-goals. Thus, using hierarchical structure leads a considerable improvement in the abilities of RL.

Association rule mining (ARM) has been applied in bioinformatics (Bebek and Yang, 2007) and large stores on market basket transactions (Lin et al., 2002). For example, in big markets with millions of transactions, ARM can automatically discover items commonly being sold together (Tan et al., 2006). It is interesting to find out which products are bought together since the seller can put them near each other in physical supermarkets or recommend the possible products in online settings. Common correlation methods are insufficient — it is impractical, and some of the extracted relationships are spurious. Also, some of them are unattractive because they are already known — for more details see Tan et al. (2006). ARM improves upon such simple methods by using a combination of two main measures in a proven efficient extraction strategy.

In the context of RL, sub-goals can be considered states that are correlated with successful policies to achieve goals; these states can be used to decompose the learning task (Digney, 1998; McGovern and Barto, 2002; Stolle, 2004). Sub-goals can help an agent combat the curse of dimensionality, accelerate the agent’s learning rate, and improve the quality of knowledge transfer (McGovern and Barto, 2002; Mousavi et al., 2014). But, extracting these states automatically is completely a challenging problem — e.g., see Chiu and Soo (2011).

In transfer learning (TL) and multi-task reinforcement learning (MTRL), different types of transferred knowledge can be used until they make learning faster and be robust to handle task differences

(Taylor and Stone, 2009). We believe TL and MTRL should be done autonomously since RL is based on trial and error and is used in environments in which little is known. The proposed method extracts hierarchical structures and similar parts of different tasks autonomously.

This paper’s main contribution is to introduce and validate a method that leverages ARM to extract sub-goals and their relationships in the form of task hierarchies and implication rules in RL and MTRL.

Background & Related Work

This section provides a brief overview of work relevant to the proposed method.

Reinforcement Learning

RL tasks are typically defined in the Markov Decision Process (MDP) framework as a

: . In this paper, we focus on finite MDPs, where = is a finite set of states, = is a finite set of primitive actions, is a one-step probabilistic state transition function, is a reward function, and is the discount rate. The agent’s goal is to find a policy (a mapping from states to actions), , that maximizes the accumulated discounted reward , for each state in . A goal state is defined as an important, task-specific state that ends an episode once visited. A start state is a state from which an agent begins an episode.

In factored MDPs, states are described by a set of state variables. Dynamic Bayesian Networks (DBNs) can be a representative model of the transition model.

Hierarchical Reinforcement Learning

Temporally extended actions can take longer than 1 step and are composed of multiple primitive actions. The options framework (Sutton et al., 1999) uses a Semi-Markov Decision Process (SMDP) to define temporally extended actions. An option can be defined as a 3-tuple , where is the initiation set (i.e., all states that the option can be started from them), is the policy (i.e., probabilistic state transition of a sequence of actions for each state in option’s initiation set), and

is the termination condition (i.e., the probability an option terminates in a given state).

HRL is a general approach that can divide learning into several subproblems, typically leveraging options. A learning problem typically is composed of a combination of several subproblems; thus, it would be much easier to learn each of them separately and then to learn main problems based on the solution of those subproblems. Each option can correspond to learning solution for a subproblem. Many methods, including (Dietterich, 2000; Jonsson and Barto, 2006; Mehta et al., 2011) have shown that when a correct hierarchy is provided by an RL expert, learning can be significantly improved.

Temporally extended actions in HRL are local policies to reach local targets states. Local targets are typically categorized into two groups: bottlenecks and sub-goals. Bottlenecks are the states that provide easy access to the neighbor regions regardless of whether they are on the successful paths or they have a high gradient of reward functions such as methods which are based on a whole state transition graph (Mannor et al., 2004; Şimşek and Barto, 2004, 2009). Sub-goals are states that not only provide easy access or high reinforcement gradients, but also must be visited often (McGovern and Barto, 2002; Stolle, 2004). For more details see elsewhere (Ghazanfari and Mozayani, 2016; Şimşek and Barto, 2009). In fact, a hierarchy of one task is learned by using temporally extended actions in the process of learning. In other words, in the options framework, temporally extended actions are often created to reach sub-goals, creating an implicit hierarchy.

If an expert has not already defined a hierarchy, several methods have been proposed to extract the task-dependent hierarchy from factored MDPs. HEX-Q (Hengst, 2003) extracts a hierarchy based on an ordering of the frequency state variable changes. State variables with the highest change frequency are assigned to the lowest level of the hierarchy, and the state variable with the lowest number of changes is considered the root node. This method cannot find the relation between states variables, potentially causing learning to diverge (Mehta et al., 2008b).

VISA (Jonsson and Barto, 2006) analyzes the effects between state variables by building a causal graph with a DBN. State variables that affect others are assigned to deeper levels in the hierarchy. VISA considers the effects of all actions regardless of the domain and may create unnecessary branches or unnecessary subtasks. Thus, it may not provide a reasonable hierarchy, an “exponentially sized hierarchy” (Mehta et al., 2008b), and performance for some problems may be poor. Such extraneous or incorrect branches may significantly reduce agent performance.

HI-MAT (Mehta et al., 2008b, 2011) leverages a single, carefully constructed, trajectory. It leverages the factored MDP to construct a MAXQ hierarchy by constructing a DBN to build causally annotated trajectory instead of causal graphs of VISA. It is claimed the constructed hierarchy is compact and comparable to manually engineered ones. The action model is given in advance in VISA and HI-MAT, although both of them mentioned there are methods like (Wynkoop and Dietterich, 2008) to build them in advance and then use the models.

Vezhnevets et al. (2017) proposed a method for extracting subgoals in Deep RL. In(Bacon et al., 2017) used different methodology “policy gradient” instead of extracting sub-goals to create temporally extended actions. It can just handle one task and need to know the number of options in advance. It will have challenges in MTRL, and when the number of subtasks is getting increased. They can create temporally extended actions not decomposing of tasks.

Transfer Learning and Multi-task Reinforcement Learning

Several methods have been proposed for TL and MTRL; they can be categorized based on task differences that they can handle and the kind of transferred knowledge (Taylor and Stone, 2009). Giving hierarchical structure by a designer limits the ability of TL and MTRL since the main usages of RL is in unknown environments. The only similar work that used a task structure is (Mehta et al., 2008a), but a hand-made, high-level, and semantic hierarchical task structure is given by the designer to show hierarchical structure of tasks can be used for TL to handle reward function differences. ARM-HSTRL decomposes tasks in a hierarchical structure autonomously. As a result of that, it can handle more range of task differences.

Association Rule Mining

The Association Rule Mining (ARM) problem is defined by where = is the set of all items and = is the set of all transactions. Each transaction is a subset of items of . For example, Table 1 has two columns, the element set in column “items” of each row is a transaction. ITEMSET = {Bread, Movie, Beer, Coffee, Book}, Transaction = , and = {Bread, Movie, Beer}.

Transaction ID Items
1 {Bread, Movie, Beer}
2 {Coffee, Book}
3 {Movie}
4 {Beer, Coffee, Movie, Book}
5 {Book, Coffee, Movie, Beer}
6 {Beer, Movie}
7 {Beer, Coffee, Movie, Book}
Table 1: An example of ITEMSET and Transaction

The relationship among states or items in the transaction set can be defined by an Association Rule. An association rule is expressed in the form of , where and are disjoint itemset; = . The frequency of the occurrence of and together in a data set, as disjoint items, is defined as a key factor, also known as the support of the association rule. The frequency of occurrence of and , relative to the frequency of the occurrence of , is known as the confidence. The definition of support and confidence are as follows (Tan et al., 2006):

In the above equations, is a count of the number of transactions observed of the elements inside of its parenthesis and is a count of the total number of trajectories. For example, Support (Beer Movie) equals to and Confidence (Book Coffee) equals to . In fact, support can be used as a measure to disregard items that occur together few times, relative to the total number of trajectories. Confidence expresses the reliability of the rule (i.e., it is the conditional probability of given ). There is a need to two thresholds for support and confidence and can be variable or fixed amounts. These two thresholds, used to disregard “trivial” rules, are known as minsup and minconf in the ARM literature (Tan et al., 2006). ARM algorithms typically consist of two parts:

  1. Frequent Itemset Generation: all of the itemsets that satisfy the minsup condition are extracted, i.e., frequent item sets.

  2. Rule Generation: all the rules of the previous step that satisfy minconf rules’s confidence calculated of frequent itemset. Building upon the outputs of the Frequent Itemset Generation, this step calculates the confidence of obtained frequent itemsets and checks the eligibility of each of them by comparing their confidences with minconf threshold.

As mentioned above, association rules are in the form of ; where and are two subsets of frequent itemsets, provided and are not empty and that they satisfy the conditions of confidence and the intersection of them is empty. It is impractical to enumerate all possible possibilities in a naive manner. As mentioned elsewhere (Tan et al., 2006), = - where is the number of rules and is the number of items. The FP-growth algorithm has been proposed for Frequent Itemset Generation by constructing a compact data structure, a FP-tree, and based on pruning. It has been shown to work for many practical problems — for the analysis of time complexity and more details about FP-growth algorithm see Kosters et al. (2003); Tan et al. (2006). Thus, there are no concerns about computational complexity and practicality ARM-HSTRL — for more details see “Theoretical analysis and relative advantages of ARM-HSTRL” subsection and Tan et al. (2006). In Rule Generation, each frequent -itemset has rules, where is the number of items of the corresponding itemset (Tan et al., 2006). The confidence value is calculated for each of the rules and evaluated based on minconf. For example, if {Beer, Movie} considered as a frequent itemset, its rules are Beer Movie and Movie Beer. Confidence (Beer Movie) equals to and Confidence (Movie Beer) equals to . If the assigned value of minconf is , the only association rule of the frequent item set would be (Beer Movie).


Key points can be extracted by observing different variations of operations and finding events that are frequently seen in successful trajectories. Similarly, ARM-HSTRL makes use of different trajectories.

ARM-HSTRL is composed of two parts (see Algorithm 1). First, ARM extracts association rules and, second, an HST-construction converts association rules to a hierarchical structure tree. ARM is composed of two steps. First, frequent itemsets are generated. Second, the rule generation procedure applies on the output of the first step.

1:  Input Transition that is a set of successful trajectories, minsup, minconf 2:  Output HST 3:  Frequent Itemset = FP-growth (Transition, minsup) 4:  Association Rules = Rule Generation (Frequent Itemset, minconf) 5:  HST-construction (Association Rules) //See Algorithm 2
Algorithm 1 ARM-HSTRL

In ARM-HSTRL, each trajectory of visited states, =, is considered as a transaction member of Transition. All visited states in successful trajectories are in the ITEMSET. As mentioned, sub-goals are states that are frequently visited in successful trajectories (i.e., trajectories where the agent reaches a goal state). In other words, the problem of finding sub-goals and relations among them can be seen as extracting association rules like , where are sub-goal states. Subgoals are almost common for different tasks. Even for some different tasks, some key subtasks are similar; when one goes to university or supermarket or restaurant, dressing, driving, and parking car are common. In summary, ARM-HSTRL in HRL is able to analyze successful trajectories that are generated come from different start states and goals states.

The FP-growth algorithm is used for Frequent Itemset Generation. If minsup is assigned to one, sub-goals must be visited in each trajectory of each Transaction. The maximum value of minsup is one. If the value of minsup is too small, the performance of FP-growth decreases because it may provide some false-positive itemsets for the evaluation of Rule Generation. Some RL domains have multiple types of successful trajectories and may have different sub-goals. Thus, the value of minsup should be set to handle such conditions. However, if minsup is assigned to a value smaller than one, the extracted hierarchical structure would have more sub-goals, and they may be unnecessary. Then, the Rule Generation procedure, as described in “Background & Related Work, ” is called on the output of FP-growth, frequent itemset.

Recall that a confidence value is the conditional probability of the occurrence of a consequent of a rule when the premise of it is seen. Also, confidences of rules are in the form of . Confidences are calculated and compared with minconf. In addition, the confidence value of each association rule can be used as a priority score to choose among corresponding temporally extended actions of association rules.

Each extracted association rule is a set of sub-goals. It is needed to extract different possible sequences of them for HST construction. In fact, the combination of HST and ARM is a sequential association rule mining procedure. The value of , time, of each sub-goal in each trajectory can be compared to create a sequence of seeing sub-goals. Each sequence shows the relationship among sub-goals in a flat manner of one association rule. For example, there are two trajectories of four sub-goals and . ’s values of and are and correspondingly in the trajectories. If the frequency of those orders is same, it shows the order of visiting and is not important to achieve the consequent. The order of each trajectory is like a local view since different sequences to achieve goals can exist. The values of ordering of each sub-goal in all trajectories can form a range; those numbers show different possibilities of ordering subtasks. By ordering the ranges and making branches, HST construction makes a general plan from all of the possible paths.

Algorithm 2, HST-construction, that makes the hierarchical structure of tasks. Each rule is in form of . are the sequence of sub-goals of the . shows the number of items in , the number of elements of the premise of the is and the number of element of the consequence of each is 1; thus, the is . is the th element from the end of . For example, is and is . is the number of association rules.

Figure 1: An example of a HST-construction.

For example, consider , , (see Figure 1). First, construct the tree with the reverse of , creating one branch with values . Then, the reverse of is added to tree, making a new branch from since cannot be matched in the tree from that point. Thus, a new branch from is created and the remaining values of assigned in that. Finally, the reverse of is added to the tree. The mismatch happens in and thus a new branch is created at node .

The HST helps an agent to choose temporally extended actions correctly. There is another way to extract a hierarchical structure based on sub-goals — the extracted order is eliminated and the elements of association rules considered as separate entities like the methods that just can extract sub goals and bottlenecks. Then, the hierarchical structures can be learned by adding corresponding temporally extended actions of extracted sub-goals in the learning phase of RL.

1:  Input AR-set is the set of association rules. AR-set = 2:  Output HST 3:    4:  Construct a tree, , with one node that is the root node, . 5:  for   do 6:     Parent-Node=  7:     for  do 8:          9:          10:        repeat 11:            shows the number of children of the Parent-Node  12:            shows the child of the Parent-Node  13:           if  then 14:              Parent-Node=  15:                16:           end if 17:            18:        until  and 19:        if  then 20:           create a new child Node in the Parent-Node: 21:           Parent-Node= 22:        end if 23:     end for 24:  end for
Algorithm 2 HST-construction


This subsection is presented based on the terms and definition of Taylor and Stone (2009). ARM-HSTRL constructs HST of tasks that their start and goal states of each run is different in RL. But, ARM-HSTRL in MTRL can be more effective. Partial policies or options and structure can be used as transferred knowledge. In other words, it could handle other types of task differences. In ARM-HSTRL, source task selection is done based on making a library of learned tasks. They are checked, evaluated, and the most compatible and highest expectable reward one is selected. The allowed learners are hierarchical approaches (for more details see (Taylor and Stone, 2009)).

The learned structures and sub-goals help agent to have a rational and quick evaluation. Checking the structure has a low cost and less prone to error, and it can infer the correct MDP and help to prioritize tasks. ARM-HSTRL can prioritize the possible tasks based on two procedures. 1) it can use the confidence of ARM to have approximations about possibilities of occurrence of each task. 2) ARM-HSTRL is like an attentional function; it extracts important states that can provide knowledge with low cost about states depended on tasks. 3) The structure of each task in the form of sub-goals is like a signature of that task. The HST is like a manual that shows the agent how to follow complicate operations. Also, it separates tasks of each other by using the sequence of sub-goals of a task as the signature of that task.

Theoretical analysis and relative advantages of ARM-HSTRL

As mentioned in Tan et al. (2006), “the size of a FP-tree typically is smaller than the size of the uncompressed data,” and in the worst-case scenario, the size of a FP-tree is effectively equal to the size of the data. The performance of the FP-growth algorithm is related to the compaction factor of the trajectories and the value of minsup. In the worst-case, support values of all combination of items are bigger than minsup, and itemsets will be generated, where is the number of items. However, ARM-HSTRL is looking for sub-goals, and the number of sub-goals in an RL task is much less than the size of state space. Thus, using the FP-growth algorithm is efficient and practical in ARM-HSTRL when the state space is large, and the number of sub-goals is relatively low.

ARM is proposed to work in real usages in which state space is large and sparse. If the state space is small or the successful trajectories have many similarities to each other, many states will be visited frequently, and ARM detects all of them as sub-goals. Clearly, the concept of sub-goals becomes meaningless in such conditions. Another possible scenario happens for adjacent states around sub-goals that might be visited frequently. Under both conditions, one efficient solution is clustering adjacent sub-goals as one entity and creating one corresponding temporally extended action for that entity. The of each state in each trajectory is saved; they are used in HST for possible orderings of sub-goals. They can be used to find close sub-goals and cluster them.

If a hierarchical policy is recursively optimal, the hierarchy is a hierarchically optimal policy by definition (Dietterich, 2000). In the following, a sketch proof is explained for converging ARM-HSTRL to a hierarchically optimal policy – it is similar to other methods that work in the options framework. As mentioned, subtasks are defined recursively in the form of temporally extended actions in the options framework. Learning in the options framework is proven (Sutton et al., 1999) to lead to optimal policies because an agent can always revert to primitive actions. Thus, ARM-HSTRL is recursively optimal, and resulting policies are also hierarchical optimal policies.

Sub-goal extraction methods (McGovern and Barto, 2002; Stolle, 2004) just can find sub-goals but not hierarchical structure of tasks, they typically calculate their measures among paths of the agent or shortest paths between nodes of graph (Ghazanfari and Mozayani, 2016). Their performances typically become worse in more severe forms as state space becomes large, and also when the number of actions to reach goals states increase. ARM-HSTRL is based on FP-growth algorithm that is proven practical from time complexity for real usages, and it considers all paths together one time. The performance of ARM-HSTRL is independent of the number of actions; it means it can scale in continuous action space.

If the proposed method is compared theoretically with the recent HI-MAT algorithm, several differences can be highlighted. DBNs must be obtained for each state variable before HI-MAT can be run, which may be time-consuming and difficult to perfect. Also, HI-MAT needs preprocessing to find DBN-closure for each state variable value and related state variables of its casual links recursively, and DBN-closure for rewards — for more details see Mehta et al. (2008b). In addition, removing unsuccessful and redundant actions cycles is necessary. Since HI-MAT works on a single successful trajectory, it should be generalized by using another function (i.e., action generalization). HI-MAT cannot be generalized from many different starting places in a few terminal states (i.e., it does not have the funnel property (Mehta et al., 2008b)). In many RL settings, there are several optimal or near-optimal trajectories, which will not be represented in HI-MAT.

In contrast, our proposed method does not require a single, carefully formed, trajectory, and it can efficiently handle the funnel property of subtasks. The proposed method can work in both MDPs and factored MDPs, while the methods mentioned above (i.e., HEX-Q, VISA, and HI-MAT) can only work in factored MDPs. Our method does not need the action model in advance or a separate phase of learning to obtain the required data for extracting hierarchy. The proposed method can easily be extended to factored MDPs by considering the value of each state variables as an item.

Figure 2: The first task hierarchy of the first testbed, Figure 5, for experiment 1.
Figure 3: The second task hierarchy of the first testbed, Figure 5, for experiment 2.
Figure 4: The task hierarchy of the second testbed, Figure 6, for experiment 3.
Figure 5: The first testbed: the size of the maze is and it has 7 sub-goals. Subgoals are colored with yellow
Figure 6: The second testbed: the size of the maze is . Each task has 5 sub-goals and is shown with a different color.

Experimental Results

In this section, three experimental results are presented on two testbeds, Figure 5 and Figure 6, for ARM-HSTRL. The agent has 5 actions, and 4 movement primitive actions. The does not change the place of the agent. The agent can move with its primitive actions in four directions: , , , . If there is a wall in the way, the agent stays in its current state. In all of the experiments, if the agent does the in a wrong place, it receives the reward of in the sub-goal places and the reward of in other states. The reward for other actions is . The agent movement with probability is according to intended action and is randomly in one of the directions with probability . The discount factor was set to = .

In constructing HST, start and goal places are chosen randomly. For each of them, the agent starts to learning with a learning mechanism like Q-learning; the learning is finished after 5000 episodes. They are ordered based on the accumulated reward, and the best five ones of them are selected. They are given to the ARM-HSTRL and HST will produce a hierarchal structure of tasks. Each node of the hierarchy is a sub-goal, and corresponding temporally extended actions are calculated (McGovern and Barto, 2002). Now, the temporally extended actions are added to current primitive actions of the agent and HST helps agent to choose temporally extended actions along with primitive actions in the phase of learning. Temporally extended actions are composed of primitive actions. If they considered like primitive actions, the number of steps to reach a goal is equal to the number of action selection call.

HRL: ARM-HSTRL is evaluated in HRL on Figure 5 for two different hierarchical structure of tasks, experiment 1 and experiment 2. In them, for comparison between Q-learning and ARM-HSTRL, there are 10 runs as in each of them a start and a goal state is chosen randomly, the maximum number of actions for each episode is 4000, and the total number of episodes is 8000. In experiment 1, Figure 2, the task hierarchy has 7 levels – it has states. If the agent enters in sub-goals states in the following order and and do the action in each of them, and then enters in the goal state of the run and do in that too, the agent receives a reward of , and the episode will be finished. The value of minsup is 0.9 and the value of minconf is 0.9.

In experiment 2, Figure 3, the task hierarchy has 3 levels, but more complicated structure- it has states. If the agent enters in one of the sub-goals states from the leaves of tree or or , then enters in one of their parent or , then in and in order and does the action in each of them, and then enters in the goal state of the run and do , the agent receives a reward of and the episode will be finished. The value of minsup is 0.3 and the value of minconf is 0.9.

There is a significant difference in speed of learning between the proposed method and Q-learning in HRL in Figures 8 and 10). The most important attribute of SMDP framework is using temporally extended actions to decrease the number of steps. It is shown in HRL in Figures 7 and 9

, temporally extended actions considerably decrease the number of steps. p-values have been calculated between the proposed method and Q-learning in each diagram by using the t-test for

= ; the significant change is validated – p-values are much smaller than 0.00001.

MTRL: In experiment 3, ARM-HSTRL is evaluated in MTRL on another maze, Figure 6, for a complex the hierarchical structure, Figure 4, as each of tasks has a different transition and a reward function. The task hierarchy has three different tasks as each of them has 5 sub-goals–it has states. ARM-HSTRL is used to extract hierarchical structure of tasks, Figure 4, in the testbed, Figure 6, as each task has a different transaction and reward function. There are 2 runs and the results are averaged. In each run, there are 1000 episodes as in each episode a start state, a goal state, and a task is chosen randomly, the maximum number of actions for each episode is 10000. It means the agent should learn to find out which task is active now and then try to reach the goal state of that task. The value of minsup is 0.3 and the value of minconf is 0.6.

Actions and their effects are same with the HRL part. One of the tasks, green or purple or yellow, is activated in each run randomly, the agent should enter in sub-goals states of activated task in the following order and do the action in each of them, and then enters in the goal state of the run and do in that too, the agent receives a reward of , and the episode will be finished. In MTRL, we just show curves of ARM-HSTRL in Figure 11, and in Figure 12 since Q-learning cannot learn several tasks with different reward and transition functions.

Figure 7: Represents the number of steps along episodes. The comparison is between Q-learning and ARM-HSTRL in experiment 1, Figure 2, of the first testbed, Figure 5.
Figure 8: Comparison of receiving rewards along episodes. The comparison is between Q-learning and ARM-HSTRL in experiment 1, Figure 2, of the first testbed, Figure 5.
Figure 9: Represents the number of steps along episodes. The comparison is between Q-learning and ARM-HSTRL in experiment 2, Figure 3, of the first testbed, Figure 5.
Figure 10: Comparison of receiving rewards along episodes. The comparison is between Q-learning and ARM-HSTRL in experiment 2, Figure 3, of the first testbed, Figure 5.
Figure 11: The number of steps of ARM-HSTRL along episode in experiment 3, Figure 4, of the second testbed, Figure 6.
Figure 12: The amount of received rewards of ARM-HSTRL along episodes in experiment 3, Figure 4, of the second testbed, Figure 6.


ARM-HSTRL works based on trajectories of states and extracts sub-goals, as states which visited frequently in successful one, and relationships among them. The relationship among sub-goals of each task, decomposition of tasks, are grouped by each other in a hierarchical structure. Specifically, two main parts of ARM, “Frequent Itemset Generation” and “Rule Generation,” are used to extract association rules. Each association rule alongside shows the sequence of sub-goals; thus, it shows the relation among them in a flat manner. In the second phase, HST-construction, a main hierarchical structure of tasks is extracted by combining association rules.

Unlike HI-MAT, the most recent work in task decomposition in HRL, ARM-HSTRL does not need the action model, not limited to factored MDP, can learn from different and several trajectories, and does not need to clean and to process the paths. Also, it is discussed and shown theoretically before, ARM-HSTRL is efficient, practical, and leads to hierarchical optimal policies. In addition, it is shown empirically in experimental results, the considerable improvement in the learning for two experiments in HRL.

The main contribution of the paper in MTRL and TL is that when the hierarchical structure of tasks are extracted autonomously, it can distinguish and capture more tasks differences, and handle them in creating partial policies. For example, in experiment 3, each task has a different transition and a reward function. Also, the agent can distinguish tasks of each other, and separate or combine the learning of each task from other tasks depending on the similarity and difference among their structures. In other words, ARM-HSTRL can handle MTRL and transfer its learning among tasks with different transition functions. In fact, a lack of structural knowledge makes Q-learning impractical.

The only work that used a task structure in TL (Mehta et al., 2008a) used a hand-made task structure to handle reward function differences. ARM-HSTRL extracts the hierarchical structure of tasks autonomously, handles much more task differences like transition functions, and uses the structure of tasks to combine and separate learning of each task depending on the structure. To the best of our knowledge, no method has been proposed to provide the abilities in MTRL and TL. It is believed, the extracted hierarchical structure in the form of sub-tasks is supplementary, and a more robust and higher level of knowledge for transferring rather than sharing value functions. The effectiveness of transferring value functions is much more sensitive to kind and amount of similarity between source and target domains. Decomposed structure of tasks provides abstraction, and the agent can reuse, generalize, and transfer the knowledge for new domains.


  • Bacon et al. (2017) Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In AAAI, pages 1726–1734, 2017.
  • Barto and Mahadevan (2003) Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.
  • Bebek and Yang (2007) Gurkan Bebek and Jiong Yang. Pathfinder: mining signal transduction pathway segments from protein-protein interaction networks. BMC bioinformatics, 8(1):335, 2007.
  • Chiu and Soo (2011) Chung-Cheng Chiu and Von-Wun Soo. Subgoal identifications in reinforcement learning: A survey. INTECH Open Access Publisher, 2011.
  • Dietterich (2000) Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. J. Artif. Intell. Res.(JAIR), 13:227–303, 2000.
  • Digney (1998) Bruce L Digney. Learning hierarchical control structures for multiple tasks and changing environments. In Proceedings of the fifth international conference on simulation of adaptive behavior on From animals to animats, volume 5, pages 321–330, 1998.
  • Ghazanfari and Mozayani (2016) Behzad Ghazanfari and Nasser Mozayani. Extracting bottlenecks for reinforcement learning agent by holonic concept clustering and attentional functions. Expert Systems with Applications, 54:61–77, 2016.
  • Hengst (2003) Bernhard Hengst. Discovering hierarchy in reinforcement learning. University of New South Wales, 2003.
  • Jonsson and Barto (2006) Anders Jonsson and Andrew Barto. Causal graph based decomposition of factored mdps.

    Journal of Machine Learning Research

    , 7(Nov):2259–2301, 2006.
  • Kosters et al. (2003) Walter A Kosters, Wim Pijls, and Viara Popova. Complexity analysis of depth first and fp-growth implementations of apriori. In

    International Workshop on Machine Learning and Data Mining in Pattern Recognition

    , pages 284–292. Springer, 2003.
  • Lin et al. (2002) Weiyang Lin, Sergio A Alvarez, and Carolina Ruiz. Efficient adaptive-support association rule mining for recommender systems. Data mining and knowledge discovery, 6(1):83–105, 2002.
  • Mannor et al. (2004) Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the twenty-first international conference on Machine learning, page 71. ACM, 2004.
  • McGovern and Barto (2002) Amy McGovern and Andrew G Barto. Autonomous discovery of temporal abstractions from interaction with an environment. PhD thesis, PhD thesis, University of Massachusetts, 2002.
  • Mehta et al. (2008a) Neville Mehta, Sriraam Natarajan, Prasad Tadepalli, and Alan Fern. Transfer in variable-reward hierarchical reinforcement learning. Machine Learning, 73(3):289–312, 2008a.
  • Mehta et al. (2008b) Neville Mehta, Soumya Ray, Prasad Tadepalli, and Thomas Dietterich. Automatic discovery and transfer of maxq hierarchies. In Proceedings of the 25th international conference on Machine learning, pages 648–655. ACM, 2008b.
  • Mehta et al. (2011) Neville Mehta, Soumya Ray, Prasad Tadepalli, and Thomas Dietterich. Automatic discovery and transfer of task hierarchies in reinforcement learning. AI Magazine, 32(1):35–50, 2011.
  • Mousavi et al. (2014) Seyed Sajad Mousavi, Behzad Ghazanfari, Nasser Mozayani, and Mohammad Reza Jahed-Motlagh. Automatic abstraction controller in reinforcement learning agent via automata. Applied Soft Computing, 25:118–128, 2014.
  • Şimşek and Barto (2004) Özgür Şimşek and Andrew G Barto. Using relative novelty to identify useful temporal abstractions in reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning, page 95. ACM, 2004.
  • Şimşek and Barto (2009) Özgür Şimşek and Andrew G Barto. Skill characterization based on betweenness. In Advances in neural information processing systems, pages 1497–1504, 2009.
  • Stolle (2004) Martin Stolle. Automated discovery of options in reinforcement learning. PhD thesis, McGill University, 2004.
  • Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  • Tan et al. (2006) P.N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Always learning. Pearson Addison Wesley, 2006. ISBN 9780321321367. URL
  • Taylor and Stone (2009) Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009.
  • Vezhnevets et al. (2017) Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
  • Wynkoop and Dietterich (2008) Michael Wynkoop and Thomas Dietterich. Learning mdp action models via discrete mixture trees. Machine Learning and Knowledge Discovery in Databases, pages 597–612, 2008.