Hierarchical reinforcement learning (HRL) is a promising approach for solving problems with a large state space and a lack of immediate reinforcement signal (Dayan & Hinton, 1993; Wiering & Schmidhuber, 1997). Hierarchical approach allows to decompose the complex task into a set of sub-tasks using hierarchical structures. It is is a natural procedure also performed by humans (Rasmussen et al., 2017). However there is one aspect of human problem-solving that remains poorly understood — the ability of finding an appropriate hierarchical structure. Finding good decompositions is usually an art-form and it is a major challenge to be able to automatically identify the required decomposition. Despite the fact that a number of achievements have been made in this direction (Hengst, 2012) discovering hierarchical structure is still open problem in reinforcement learning.
Most of the efforts aimed at learning in hierarchies are concerned acceleration of the Q-learning by identifying bottlenecks in the state space (Menache et al., 2002). The most popular framework in these works is Options (Precup et al., 1998). Within it artificial agents are able to construct and extend hierarchies of reusable skills or meta-actions (options). A suitable set of skills can help improve an agent’s efficiency in learning to solve difficult problems. Another commonly used approach in HRL is the MAXQ framework (Dietterich, 2000) where the value function is decomposed over the task hierarchy. Automated discovery of options hierarchy (Mannor et al., 2004) and task decomposition within MAXQ approach (Mehta et al., 2008) showed good results in a number of synthetic problems e.g. Rooms or Taxi environments. Most of the real problems in robotics are very different from these artificial examples. The tasks of manipulator control and robot movement in space are of great practical interest (Tamar et al., 2016; Gupta et al., 2017). Although the existing attempts to adapt these approaches to continuous space (Daniel et al., 2016), they are of little use in these tasks. There are at least two reasons for this. The fist is a lack of mixed action and state abstraction (Konidaris, 2016). The second is that pseudo-rewards should be specified to learn hierarchically optimal policies.
In this paper, we focus on the automatically discovering sub-tasks and hierarchies of meta-actions within on-model variant of the HAM framework (Parr & Russell, 1998)
. One motivation for using abstract machines is that HAM approach is able to design good controllers that will realize specific behaviours. This is especially important when developing control systems for robotic systems. Among other things HAMs are a way to partially specify procedural knowledge to transform an Markov decision process (MDP) to a reduced semi Markov Decision Process (SMDP). In this paper we propose a new approach to the problem of learning structure of an abstract machine by introducing the “internal” environment where a state represents the structure of HAMs. We find the structure of machines for particular class of external environment states and then combine constructed machines into a superior machine. Automated discovery of such structures is compelling for at least two reasons. First, it avoids the significant human effort in engineering the task-subtask structural decomposition. Second it enables significant transfer of learned structural knowledge from one domain to the other.
2.1 Semi Markov Decision Problems
The set of actions in hierarchical reinforcement learning consists of the primitive actions and temporal delayed or abstract actions (skills or meta-actions). Because of this, we need to extend the notation of Markov Decision Processes (MDPs) that is defined the environment in classical reinforcement learning. MDPs that include abstract actions are called semi Markov Decision Problems, or SMDPs (Puterman, 1994). In the task of reinforcement learning the agent’s goal is to find the optimal strategy. An agent using Q-learning in the MDP environment achieves the goal by performing updates of value, going into state and receiving a reward , after calling the action in the state :
where is the learning rate and is the discounting factor.
Let be a number of steps which are needed to complete the abstract action starting from the state and terminating in state . The transition function
gives the probability of the action: . Then the formula of Q-learning for the SMDP is written as follows
where is a number of steps performed after calling the action in the state before the state was reached, - the cumulative reward received during this time.
Introducing abstract actions is important step that allows us to accelerate learning process although we continue using primitive actions. Abstract actions and SMDPs naturally lead to hierarchical structure of the set of actions and can be policies from smaller SMDP. It should be noted that HRL cannot guarantee in general that the optimal solution of a full problem will be necessarily found.
2.2 Reinforcement Learning with HAMs
The HAM approach limits the possible actions of the agent by transitions between states of the machine (Parr & Russell, 1998). An example of a simple abstract automaton can be: ”constantly move to the right or down.” Transitions to certain states of the machine cause execution of actions in the environment, and the remaining transitions are of a technical nature and define internal logic.
An abstract machine is a set of five elements , where is a finite set of machine states, - an input alphabet corresponding to the space of states of the environment, - the output alphabet of the abstract machine, - the function of transition to the next state, with the current state of the machine, and the state of the environment , - output function of the machine.
The machine’s states are divided into several types: Start - this state starts the operation of any machine, Action - in this state, the action to be taken is performed when the machine goes to this state, Choice - if from this state there are several transitions, then the choice of the next one is stochastic, Call - a transition to this state suspends the execution of the current machine and calls the machine specified in this state,Stop - transition to this state stops execution of the current machine.
Due to the choice of the next state for the transition is not deterministic only in the state of Choice, then update of value is performed for previous and current Choice states. Iteration occurs taking into account the current state of the environment:
3 Hierarchy formation
We solve the problem in an environment where additional parameters are added to standard information about states. It should be noted that there may be several additional features. All possible combinations of features we call clusters. E.g., for a well known blocks domain, it can be information in which part of the world the agent is located.
Initially, the agent is trained using a training set of tasks. The training set consists of a number of tasks, in which the algorithm needs to generalize received information and then apply it to solve similar or even more complex task in the same environment.
At the first stage of the algorithm, an abstract machine is built for each cluster, which will always be called if the environment reports what is currently in the cluster. The construction takes place through the search of possible abstract machine, using a number of heuristics.
Generation and pruning of abstract machines occurs as follows:
A list of possible vertices from which an machine can consist is determined. Each machine has a Start and Finish vertex, and there can be Action vertices of various types of actions and a Choice vertices.
Parameters is specified: the maximum number of vertices in the machine, number of vertices of each type.
All possible ordered permutations of vertices are generated.
The edges begin to be added to the machine, taking into account the limitation of the HAM structure: the Start state must have only one outgoing edge and can not have an incoming edges, the Stop state must have only one incoming edge and can not have outgoing edges, the Action state can only have one outgoing edge, each of the vertices can not have self-loops, there can not be edges from the Choice state in the Stop state, the Choice state must have at least two outgoing edges.
For each machine it is checked that all the vertices are in the same component and that all the vertices are reachable from the Start vertex and there are no vertices that can not reach the Stop state. Additionally, the absence of Choice cycles is checked.
When such a list of machines is built, the next stage of pruning the machine takes place. For this stage, it is necessary to train the algorithm, using for each cluster a standard machine that consists of single Choice state and edges to all possible states of the Action for give environment, this machine corresponds to the choice of any action in each environment state.
Checking of machine occurs as follows: a cluster is chosen for which the machine is built, an machine is taken, which must be checked and the trained machine. The learning algorithm is used for the selected machine and cluster, and for the remaining clusters a trained standard machine is used. If after a small number of iterations the process converges, then we memorize that the machine is applicable for current cluster.
When for each cluster a list of applicable machine is built we apply the internal environment algorithm for each of the sets of cluster machines. According to the structure of graphs, an internal environment is formed in which the actions of the agent are the selection of the vertices of the graph (in the first step) and the addition of edges to subsequent ones. The environment is organized so that the agent’s choice of action leads to the transition to the next state for which there is a proven machine.
Iterative process is started in which for each cluster an attempt is made to build a better one than the current machine with the trained internal environment. Initially, a machine is used for each of the clusters. If the machine constructed by the internal environment leads to a better result, then it is added to the solution, otherwise the algorithm proceeds to the next step.
We consider the process of changing of HAM structure as a sequence of special internal or “mental” actions of the agent. It means that the agent acts in the second internal environment not just in external environment of surrounding objects (see fig. 1).
The state of the internal environment we assume to be a structure of a graph and some additional information containing e.g. graphs’ statistics. Acting in variety of task in the external environment the agent can learn to build suitable hierarchies for the whole set of tasks in cluster. Also, the agent will try to produce hierarchies for new, previously not appearing, tasks.
Let consider the set of tasks in one or different external environments in single cluster. For which task the agent will automatically built the hierarchy of HAMs. Each task corresponds to some SMDP for the external environment with the set of states and set of actions .
At each agent step, the external environment returns information about the reward received at the current step , the current state of the external environment , information about the end of the current task . We define an internal environment consisting of:
the set of state , where each state corresponds to the structure of the graph that defines the HAM hierarchy;
the set of actions , where each action corresponds to some change in the structure of the graph.
The internal environment is episodic and at the beginning of each episode the agent receives information about the external environment for which it will be trained to build a HAM hierarchy. is a number of steps in the internal environment. Each action performed by the agent in the internal environment causes the graph structure to change.
Consider the process of Q-learning in the internal environment:
a value of function that compares the statistical indicators of training with the current hierarchy to a certain number. The value of the function is calculated after training on the environment. Such an indicator can be a binary value, which is true if the agent collects the necessary total remuneration in the external environment. In this case, if the value is true, then the algorithm can decide on a possible transition to the same state of the hierarchy. Otherwise it will be advantageous to continue the search for the hierarchy;
is a state of the HAM hierarchy in the previous step;
– the value of the function in the previous step;
is a reward received by the agent in the previous step corresponding to the total reward for the external environment within the task . Since the learning process is sufficiently noisy, we perform several tests and take the average;
– the action selected in the previous step;
– state of the environment in the current step.
Then the Q-learning function will be written as usual:
The listing 1 shows pseudo code of the algorithm of acting in the internal environment and illustrates the idea of transferring of total reward received in the external environment into internal reward indicating the quality of the performed “mental” action.
The function of transitioning to the next state of the internal environment training algorithm receives an action acting on the input, according to which it changes the current state of the HAM, using the function . Based on the received state of the machine, the new state is calculated, the function is . It is checked whether it is possible to start a new machine using the function check for reachability of the Stop state and the check function for cycles. If the test passes, then the machine starts on the outside environment and the total reward remains, otherwise the machine is associated with a large negative reward. By the state of the automaton and the reward received, the function is calculated.
4 Experimental evaluation
We consider the robotic inspired environment in which a manipulator with a magnet performs actions on metal cubes. The goal is to build a tower, given the height. The agent is available 5 actions, it can move the manipulator to one unoccupied cell, in each of the four adjacent sides, and can switch the toggle of magnet (see fig. 2). If the magnet is turned off, then the cube instantly falls down on an unoccupied cell. Holding the manipulator, the cube moves horizontally (move left or right) only from the uppermost position. If the agent tries to apply the action, not in the upper position, then the position of the manipulator does not change, and the cube immediately drops down.
The environment is episodic and ends after a certain number of actions. The environment is completed ahead of schedule if a tower of the required height is built. For the construction of the tower of the desired height, the reward is , for any other action the reward is .
The division into clusters in an environment occurs according to the following parameters: at what height is the manipulator located and does the magnet holds the cube?
For the training set, we used two environments:
height: 4, width:3, number of cubes: 3, episode length: 200, tower target size: 3.
height: 5, width:4, number of cubes: 4, episode length: 500, tower target size: 4.
The test set consisted of one environment: height: 6, width:4, number of cubes: 5, episode length: 800, cubes tower target size: 5.
To determine the overall reward for several environments, we used normalization with respect to the maximum reward received in the given environment with the cluster selected. For exploration, we used -greedy, with initial value 0.1. The discount factor was set to 0.99. was set to 0.1.
During the preliminary experiments, the above approach has not been fully demonstrated. For the stage of combining machines we used a set of the best machines built during the training. For the consolidation stage, we used a simplified algorithm. An iterative process is under way to improve the integrated solution, at each step the remaining cluster with the best total reward is taken. And two cases are checked: in the first one, the automaton of the cluster under consideration is added to the combined solution, but in the second one there is n
The machines were not built for clusters that were not represented in each of the environments. The results of the experiment are shown on the diagram (see fig. 3). It shows that this approach significantly increased the rate of convergence in comparison with the standard Q-learning algorithm. The algorithm was built machines (see Appendix), which build the tower on the left and go to the right of the cubes. These meta-actions turned out to be profitable and significantly increased the learning speed of the algorithm.
In the paper we propose a new approach to hierarchy formation within the HAM framework. We chose the HAM abstraction because HAM approach is able to design good controllers that will realize specific behaviours. To do this, we introduced the so-called internal or “mental” environment in which the state marks the structure of the HAM hierarchy. The internal action in this environment leads to change the hierarchy of HAMs. We suggest the classical Q-learning in the internal environment which allows us to obtain an optimal hierarchy. We extends the HAM framework by adding on-model approach to select the appropriate sub-machine to execute action sequences for certain class of external environment states. Preliminary experiments demonstrated the prospects of the method.
This work was supported by the Russian Science Foundation (Project No. 16-11-00048).
- Daniel et al. (2016) Daniel, Christian, van Hoof, Herke, Peters, Jan, and Neumann, Gerhard. Probabilistic inference for determining options in reinforcement learning. Machine Learning, 104(2-3):337–357, 2016.
- Dayan & Hinton (1993) Dayan, Peter and Hinton, Geoffrey. Feudal Reinforcement Learning. Advances in neural information processing systems, pp. 271–278, 1993.
Dietterich, Thomas G.
Hierarchical Reinforcement Learning with the MAXQ Value Function
Journal of Artificial Intelligence Research, 13:227–303, 2000.
- Gupta et al. (2017) Gupta, Saurabh, Davidson, James, Levine, Sergey, Sukthankar, Rahul, and Malik, Jitendra. Cognitive Mapping and Planning for Visual Navigation. ArXiv: 1702.03920, feb 2017.
- Hengst (2012) Hengst, Bernhard. Hierarchical Approaches. In Reinforcement Learning, pp. 293–323. 2012.
- Konidaris (2016) Konidaris, George. Constructing abstraction hierarchies using a skill-symbol loop. IJCAI International Joint Conference on Artificial Intelligence, 2016-Janua:1648–1654, 2016.
- Mannor et al. (2004) Mannor, Shie, Menache, Ishai, Hoze, Amit, and Klein, Uri. Dynamic abstraction in reinforcement learning via clustering. In Twenty-first international conference on Machine learning - ICML ’04, pp. 71, New York, New York, USA, 2004. ACM Press.
- Mehta et al. (2008) Mehta, Neville, Ray, Soumya, Tadepalli, Prasad, and Dietterich, Thomas. Automatic discovery and transfer of MAXQ hierarchies. In Proceedings of the 25th international conference on Machine learning - ICML ’08, pp. 648–655. ACM Press, 2008.
- Menache et al. (2002) Menache, Ishai, Mannor, Shie, and Shimkin, Nahum. Q-Cut—Dynamic Discovery of Sub-goals in Reinforcement Learning. In ECML 2002: Machine Learning: ECML 2002, pp. 295–306. 2002.
- Parr & Russell (1998) Parr, Ronald and Russell, Stuart. Reinforcement learning with hierarchies of machines. Neural Information Processing Systems (NIPS), pp. 1043–1049, 1998.
- Precup et al. (1998) Precup, Doina, Sutton, Rs, and Singh, S. Multi-time models for temporally abstract planning. Advances in Neural Information Processing Systems, 10(1995):1050–1056, 1998.
- Puterman (1994) Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994. ISBN 978-0-471-72782-8.
- Rasmussen et al. (2017) Rasmussen, Daniel, Voelker, Aaron, and Eliasmith, Chris. A neural model of hierarchical reinforcement learning. PLOS ONE, 12(7):e0180234, jul 2017.
- Tamar et al. (2016) Tamar, Aviv, Wu, Yi, Thomas, Garrett, Levine, Sergey, and Abbeel, Pieter. Value Iteration Networks. arXiv, pp. 1–14, feb 2016.
- Wiering & Schmidhuber (1997) Wiering, M. and Schmidhuber, J. HQ-Learning. Adaptive Behavior, 6(2):219–246, 1997.