It has been shown that hierarchical reinforcement learning performs remarkably well on many complex tasks gehring2021hierarchical; gurtler2021hierarchical; nachum2018data; levy2017learning; ghosh2019learning; eysenbach2019search. This is due to the fact the control or sequential decision making in complex dynamical systems is often easier to synthesize when decomposed hierarchically nachum2019hierarchy. The high-level agent breaks down the problem into a series of sub-goals to be sequentially executed by the low-level agent.
However, the true potential of hierarchical learning methods has not been fully explored. Many current approaches in hierarchical reinforcement learning has been designed by assuming environmental determinism. However, in the real world, this assumption is not true, since phenomena occur irregularly. In this paper, we propose a solution that is capable of operating in environments with random dynamics.
So far, a strong emphasis has been placed on solving problems arising from the complex dynamics of the interaction between agents at different levels levy2017learning; nachum2018data; gurtler2021hierarchical
. However, the proposed solutions are not designed to operate in randomized dynamic environments – dynamics in which some events occur in an irregular and therefore unpredictable manner. Many solutions assume that a high-level policy designates the sub-goal for a fixed period of time. During this time, the low-level agent is expected to complete the sub-goal. The duration of such a high-level action is either set by a human expert as a hyperparametermcgovern2001automatic; vezhnevets2017feudal; levy2017learning; nachum2018data or it is a part of the output of a high-level agent gurtler2021hierarchical. This duration generally does not change in response to random events. If during the pursuit of the sub-goal by a low-level agent, any circumstances arise that make this plan obsolete, then we have to wait until the end of the sub-goal anyway. In such a situation either an inadequate behavior is exercised or we hope that the low-level agent will set a new plan for itself thereby taking over the role of the higher-level agent.
How the current solutions of the hierarchical reinforcement learning are not adapted to the dynamic environments was already shown by gurtler2021hierarchical. The authors demonstrate that in the environments requiring precise timing of low and high level policies, it is important to adapt the duration of high-level action. Therefore, in gurtler2021hierarchical, the high-level agent returns the sub-goal and the time in which it is to be achieved as parts of its action. This precise communication is effective in dynamic situations in which meticulously pursued sub-goals allow for better planning.
In this paper, we are going one step further and consider the problem of dynamic environments in which the agent is unable to accurately predict the future. Consequently, it is essential that a high-level policy is able to react immediately to random situations and to replace current sub-goals.
Suppose the state coordinates not being under direct agent control are subject to some random dynamics. In particular, we can imagine a situation where the agent is a ship and its aim is to cross a drawbridge, which opens in an irregular manner, e.g. due to varying traffic volumes. Since we are unable to predict such a situation, we need to react to it. Such unexpected events can be anticipated in two ways in the planning process. One can designate short-range sub-goals so that each is determined in relation to the most current state of the environment. However, such an approach could lead to a demotion to a one-level policy. Another way is to constantly monitor the dynamics of the environment during the sub-goal pursuit, and if circumstances so require, be ready to change the designated sub-goal. This approach seems to align with the human planning process, where, for example, driving a fixed route to work is modified when a road accident occurs.
In our proposed approach, the higher-level control constantly verifies whether the current sub-goal should not be replaced by a different one, more appropriate for the current state. In the basic scenario, the high-level agent returns the sub-goal and the time it should take to complete it. However, a high-level agent, instead of being active only every certain number of environment steps, receives an observation after each bottom-level agent step. Based on this, a decision is made whether we are still pursuing the previously set sub-goal or if we need to change it to a new one.
The contribution of this paper can be summarized as follows:
We introduce a method, EAT, of monitoring and possibly terminating higher level actions in hierarchical RL. This method allows a hierarchical policy to immediately react to random events in the environment.
We design two strategies for monitoring and terminating the higher level actions.
We introduce a framework for hierarchical decomposition of Markov Decision Processes into subprocesses in which rewards for future events are discounted over time elapsing to their occurrence rather than over the number of actions to their occurrence.
2. Related work
Hierarchical Reinforcement Learning (RL) is based on decomposing long-horizon reinforcement learning tasks into smaller sub-tasks which are chosen by the higher-level policy (parr1997reinforcement; pateriaHierarchicalReinforcementLearning2021). Sub-tasks are composed of atomic actions and might be learned with RL or hand-crafted. A sub-tasks learning, also called sub-task discovery, could be performed in parallel to learning hierarchical policy or as an independent pretraining stage of the process.
One way to introduce hierarchy into control is to decompose a given Markov Decision Process into a hierarchy of smaller MDPs. An action of a higher-level MDP selects an option, which is a lower-lever MDP which is executed until its terminal state precup2000temporal; sutton1999between. In some approaches, options are predefined sutton1999between; barto2003recent; precup2000temporal; shankar2020learning, i.e., separately trained or specially handcrafted. An alternative approach is for example the Option Critic (baconOptionCriticArchitecture2016) method, in which Options (policies and subtask ending functions) are discovered from the beginning of the training of the entire hierarchical policy. One of the first papers where Option discovery was used in the problem of continuous control is (bagaria2019option). However, an important limitation of the proposed method is the requirement that the target task includes explicit goal states. From this family of methods, the closest approach to ours is the model proposed in li2018learning. The terminating function of an option is also trained, so that the set of terminal states changes with training time. The significant difference is that the test method is based on low-level deterministic options representing the knowledge of a human expert.
Another similar approach is skill-based methods eysenbach2018diversity; sharma2019dynamics; campos2020explore. Skills are modeled by a low-level policy conditioned on an additional variable – different variable triggers different behavior. Usually, the low-level agent learns skills in unsupervised way – intrinsic reward by maximization of the mutual information (MI) between skill and trajectories resulting from using that skill. zhang2021hierarchical introduced the HIDIO architecture, in which the low-level agent learns skills in the unsupervised way. As part of a high-level action, the variable that triggers the appropriate skill is returned.
In another approach, the high-level policy output serves to communicate with a lower-level agent schmidhuber1991learning; mcgovern2001automatic; vezhnevets2017feudal. Usually, high-level information is attached to lower-level observations. In most cases, high-level politics returns the sub-goal for low-level policies. In this scenario, the low-level agent is rewarded for approaching the designated sub-goal. However, using previous experience to train a hierarchy of policies raises the problem of non-stationarity. This experience taken literally is irrelevant as selecting targets for lower-level policies leads now to different results, as these policies have changed due to learning. To address this issue, different methods of subgoal re-labeling were proposed. levy2017learning proposed hierarchical experience replay (HAC): actual states achieved are used as if they have been selected subgoals. On the other hand, nachum2018data presented an approach named HIRO based on off-policy correction
where the unattained subgoal is re-labeled in transition data with another one, drawn from the distribution of subgoals that maximize the probability of the observed transitions.
However, few articles have been written that consider a scenario in which low-level policy does not work for a fixed number of steps. The selection of the appropriate frequency in decision-making by a high-level policy has a significant impact on the performance of the proposed methods. Too long or too short a period of time may degenerate the entire approach to a non-hierarchical method or disrupt the convergence of the algorithm. The possibility of dynamically changing the higher-level action in hierarchical reinforcement learning was explored in zhou2020. In the proposed TEMPLE algorithm, the temporal switch that decides if a new high-level action should be selected is part of the lower-level output.
The main difference between the above approach and our approach lies in the level at which the decision to change the subgoal is made. In the TEMPLE method, this low-level policy returns a switch signal, which determines how much, in the next step, the new high-level action will be used, and how much the present one will.
The HiTS algorithm gurtler2021hierarchical represents the approach opposite to TEMPLE, and still different from ours. A high-level agent not only determines the target position but also the time in which it has to be achieved. Therefore, the high-level action is a pair, , comprising a subgoal and the number of the low-level steps in which the subgoal should be achieved.
As part of our approach, not only a low-level agent is supposed to achieve the given subgoals in a specific time, but also a high-level agent has the ability to immediately change the implemented plan and interrupt the current high-level action. The method designed in this way is better suited to operating in dynamic or random environments. Thanks to the ability to change the target, the high-level agent does not have to accurately predict the future and is able to efficiently correct the plan when, for example, it is not implemented correctly due to insufficient convergence of the low-level agent.
3. Problem formulation
We consider the typical RL setup (2018sutton+1) based on a Markov Decision Process (MDP): An agent operates in its environment in discrete time . At time it finds itself in a state, , performs an action, , receives a reward, , and the state changes to . Choosing actions, the agent anticipates future rewards with the discount factor of .
We assume that effective hierarchical control is possible in this MDP. Let there be levels of the hierarchy. Each -th level defines an MDP with its state space , its action space and rewards. For the highest level and for the bottom level . Taking an action at -th level, , , launches an episode of the MDP at -st level. defines the goal in this lower-level episode, thus the states and rewards in this episode are co-defined by . Once this episode is finished, another action at -th level is taken.
Actions at the -th level are defined by a policy,
where is the state of the system perceived at -th level of the hierarchy, at time , and is a random element. For , represents, among others, the objective of the current operation at -th level, resulting from the current -st level action. The random element enables exploration and learning of this policy.
The goal is to learn the hierarchy of policies (1) so that in each state at each hierarchy level the expected sum of future discounted rewards is maximized.
Robotic example with .
In robotic applications of RL, state
is usually a vector that comprises (i) positions of joints, (ii) velocities of the joints, (iii) readouts of sensors outside joints, (iv) a vector that defines the current tasks. Actions at the 2-nd level of the control hierarchy define target positions (and velocities) of the joints while the 1-st control level brings the joint positions (and velocities) to the given targets. Effective control of the robot with this hierarchy is possible because the robot is only able to accomplish given tasks by means of its body.
4.1. Hierarchy of MDPs
We assume that a decomposition of the original MDP into a hierarchy of MDPs satisfies the following conditions:
A higher-level action corresponds to an episode of the lower-level MDP. This episode may finish with a terminal state (e.g., when the lower level goal is achieved) or with a nonterminal state (e.g., when a predefined timeout has been reached).
Each hierarchy level has its own discount factor, , with equal to of the original MDP. The discount factor at the given level is related to how farsighted the policy at this level should be. Typically, the policy needs to be more farsighted at higher (in words tactical or strategic) levels, thus .
At each hierarchy level, a reward, , is defined for each time instant of the original time, , with the rewards for the highest level equal to the original MDP rewards, . At each level, the sum of discounted future rewards
is being maximized.
Note that the above assumptions are atypical. Usually gurtler2021hierarchical, it is assumed that each hierarchy level has its own time indexing determined by subsequent actions at this level. A single reward is then being paid at each hierarchy level for a single action performed at this level, and further rewards are discounted at , , etc. However, when the policy at this level has any control over the duration of actions, it can manipulate this duration to maximize the sum of rewards, which usually contradicts the objectives of control at this level.
For instance, suppose at the level at time , a catastrophic event is expected to happen in original time instances, covering actions at this hierarchy level. In our formulation, the negative reward for this event is always discounted at . When the typical formulation is adopted, that reward is discounted at . The policy may learn to minimize this weight just by choosing short-lasting actions, thereby maximizing .
A hierarchical RL algorithm designed for the above typical formulation can usually be adjusted to our formulation. However, it may rise some technical issues, because event-specific discounting of further rewards needs to be introduced instead of just universal discounting.
4.2. Emergency higher-level action termination
The low-level actions work to realize plans imposed by the higher level actions. However, it may happen that sticking to these plans is inefficient on unanticipated changes of the environment state. Then, the current higher level plan needs to be terminated and a new one needs to be initiated. Following this rationale, we propose to proceed according to the following principles, applicable for all hierarchy levels above the bottom one:
A proposal action is designated at each time with the random element applied earlier to designate the currently realized action.
The proposal action and the currently realized one are compared.
If the proposal actions appears to be better or much different (explained below) than the currently realized one, then (1) the currently realized actions at this level and below are terminated, (2) the events with this actions, rewards gathered and the following states are applied to learning, (3) new actions, with new random elements are designated at this level and below.
Our proposed approach is based on running Algorithm 1 at each time .
We propose two strategies to determine if the future rewards could increase thanks to the termination of the current action (the condition in Line 3 of the algorithm). Discussing these strategies, we will use the following notation. We assume that for each level an action-value function approximation of the current policy is available
An action currently realized at -th level is denoted by . It is defined by an action, , selected at time . Not necessarily . For instance, the action may have indicated a target point to be achieved in time-steps; then defines the same target to be achieved in time-steps. Following (1), we assume and . Let a currently proposed alternative to the on-going action be
In this strategy, we terminate the current action if it seems to be worse that the proposed alternative . This is found when
is greater than a certain threshold related to variability of the of values. We assume this threshold equal to , where is a parameter andfor different time when actions at the level were started.
4.2.2. Changing target
In this strategy, we terminate the current action if it significantly differs from the proposed one . To be able to determine how different are the actions from one another, we need further assumptions about the nature of the actions. Suppose there are two functions, and , that give the following meaning to actions: is a point to which the projection of the state is to converge in time . Actions are considered different, when they assume a notably different change of the state projection in comparison to their average length. The current action is terminated when
for being a threshold parameter.
4.3. EAT with HiTS
The main algorithm of hierarchical RL to which we apply our proposed method of terminating actions is HiTS gurtler2021hierarchical. In this method, higher level actions define goals for state projections along with time to achieve these goals, as discussed in Section 4.2.1. At each hierarchy level a plain RL algorithm is used to learn such as SAC 2018haarnoja+3. State-of-the-art efficiency of HiTS results primarily from the use of hindsight action relabeling: Suppose at a certain hierarchy level the target state projection is missed, and another final state projection is reached instead. Then, the whole episode of reaching this final projection is added to the experience for this learning level as if this final projection was the actual target.
Introducing emergency action termination into HiTS requires the following changes:
Rewards at each level now needs to be defined for each instant of the original MDP.
The discounting needs to be changed as discussed in Section 4.1. (That entails adjustments in the plain RL algorithms that operate at each level of the hierarchy.)
Introducing Algorithm 1 (the conditional action termination) at each time-step .
5. Experimental study
In this study, we evaluate experimentally the approach to hierarchical RL introduced in this paper. We compare four algorithms:
The state-of-the-art HiTS algorithm gurtler2021hierarchical.
HiTS+VariableDiscountSAC — for an ablation, we also analyze the behavior of HiTS with the modified discounting scheme presented in Section 4.1. This scheme is a prerequisite to EAT but it actually is operational alone.
We perform experiments on seven benchmark environments. Five of them are taken fromgurtler2021hierarchical: Pendulum, Platforms, Drawbridge, UR5Reacher, Ant4Rooms. We also present results on two new environments that build on Drawbridge and Platforms but include stochastic elements. They are:
NoisyDrawbridge – similar to Drawbridge environment, but there are three different times of drawbridge openings: 400, 500 or 600. Thus, in every episode the drawbridge is opened at different time step, which prevents the agent from replaying the same sequence of actions in every episode.
NoisyPlatforms – built on Platforms environment, but with an element of uncertainty: One or both platforms can be immediately frozen with a chosen probability at every time step. We freeze the active platform with probability 0.005 at every time step. Moreover, we set the maximum number of freezes of this platform to two.
Experimental settings are borrowed from gurtler2021hierarchical. For the EAT algorithm, we use the threshold parameter (Section 4.2) for EAT(Q) and for EAT(geom) in all the experiments.
The results of the qualitative comparison of HiTS, EAT(Q), EAT(geom), and HiTS+VariableDiscountSAC are presented in Figures 2. We find, that terminating high-level actions is beneficial even in deterministic environments such as Drawbridge and makes no difference in terms of performance in others. EAT(Q) and EAT(geom) interruption mechanisms provide means to excel in this environment and achieve nearly perfect success rate. Meanwhile, the introduction of VariableDiscountSAC merely improves the time of learning in Drawbridge and does little change in other environments.
Qualitatively, we find that both EATs versions perform considerably better in NoisyDrawbridge and NoisyPlatforms environments. Sudden changes in environment elements cannot be dealt with well in HiTS, due to unchangeable goals for lower-level politics. As expected, EATs learn to interrupt the invalid subgoals closely after the first indication of the current subgoal lapse. This phenomenon is studied in depth in Section 5.2.
5.2. Action termination analysis
We analyzed the behavior of models trained in experiments described in 5.1 to verify if the emergency action termination mechanism works as intended by responding to random changes in an environment. We used 5 models trained using EAT(Q) and EAT(geom) methods on NoisyDrawbridge and NoisyPlatforms environments. Over the course of 500 episodes for each model, we measured the delay between the occurrence of a random event and the following interruption. The distribution of these delays is presented in the figure 4.
It is seen that there is a large number of interruptions in the time steps following the unexpected change in the environment. This is best seen in the NoisyDrawbridge environment, where many actions are changed within the first 100 steps after the drawbridge starts to open. It should be noted that the interruptions in this environment do not happen immediately after the random event, as it takes precisely 63 steps until the bridge allows the ship to sail through. It should also be noted, that the interruption times concerning the environment change are similar for models trained using both EAT variants. Apparently, this behaviour is the cause of the superior performance of EAT methods on the NoisyDrawbridge environment, as seen in the fig. 3.
The randomness of the NoisyPlatforms environment has no such strict relation with changing the possible options for the agent as in the NoisyDrawbridge environment. This is most likely caused by the fact that freezing a platform alone is not necessarily a reason to change the high-level action immediately. However, it changes the trajectory of events in such a way that a better action may be selected after some time. This is reflected by a much smoother distribution of interruptions following the random events in this environment.
Our experiments (HiTS+VariableDiscountSAC) show that only changing the reward and discounting pattern from that over high-level actions to that over low-level time does not affect the performance consistently. However, it is an essential prerequisite for terminating higher-level actions.
In four out of seven our analyzed environments (Drawbridge, AntFourRooms, NoisyDrawbridge, NoisyPlatforms) EAT achieved a higher success rate than HiTS. In one case (UR5Reacher) both algorithms reached the same success rate. Note that the original environments are deterministic, nothing unexpected happens there, and the only reason to terminate an action is the unexpectedly low quality of this action. However, in the environments with more randomness, NoisyDrawbridge and NoisyPlatforms, there are objective reasons to terminate actions and EAT consequently produces significantly higher ultimate success rates than HiTS.
In this paper, we introduced a framework for dividing a Markov Decision Process into subprocesses in which rewarding and discounting are over the original MDP time. In this framework, we introduced a method of terminating higher-level actions when they become obsolete due to random events in the environment. Our proposed method enables an immediate response of control to these events, thereby increasing control quality. The experiments confirm that the quality increases indeed, especially in non-deterministic environments.
Appendix A Algorithms’ hyperparameters
Hyperparameter of HiTS have been taken from gurtler2021hierarchical. Common parameters for all algorithms are presented in Tab. 1 for Platforms and Drawbridge environments and in Tab. 2 for Pendulum, UR5Reacher and AntFourRooms environments. Hyperparameters specific for HiTS algorithm with original SAC and VariableDiscountSAC are presented in Tab. 3 and Tab. 4 respectively. Hyperparameters of EAT(Q) and EAT(geom) are presented in Tab. 5 and Tab. 6 respectively. Noisy environments introduced in 5 share hyperparameters with their original versions.
|Parameter||Platforms (+NoisyPlatforms)||Drawbridge (+NoisyDrawbridge)|
|Random action fraction||0.05||0.05|
|Grad steps/env steps||0.5||1|
|Max actions in episode||10||5|
|Random action fraction||0.05||0.05|
|Grad steps/env steps||0.5||1|
|SAC target entropy||-13.223820062136808|
|Random action fraction||0.05||0.05||0.05|
|Grad steps/env steps||1||0.07||0.07|
|Max actions in episode||22||24||22|
|SAC target entropy||-2.60162130869488|
|Random action fraction||0.05||0.05||0.05|
|Grad steps/env steps||1||0.07||0.07|
|HL flat algorithm||SAC|
|HL flat algorithm||VarDiscountSAC|
|HL flat algorithm||VarDiscountSAC|
|Exponential average of Q smoothing||0.999|
|HL flat algorithm||VarDiscountSAC|