Learning Functionally Decomposed Hierarchies for Continuous Control Tasks

02/14/2020 ∙ by Lukas Jendele, et al. ∙ 7

Solving long-horizon sequential decision making tasks in environments with sparse rewards is a longstanding problem in reinforcement learning (RL) research. Hierarchical Reinforcement Learning (HRL) has held the promise to enhance the capabilities of RL agents via operation on different levels of temporal abstraction. Despite the success of recent works in dealing with inherent nonstationarity and sample complexity, it remains difficult to generalize to unseen environments and to transfer different layers of the policy to other agents. In this paper, we propose a novel HRL architecture, Hierarchical Decompositional Reinforcement Learning (HiDe), which allows decomposition of the hierarchical layers into independent subtasks, yet allows for joint training of all layers in end-to-end manner. The main insight is to combine a control policy on a lower level with an image-based planning policy on a higher level. We evaluate our method on various complex continuous control tasks, demonstrating that generalization across environments and transfer of higher level policies, such as from a simple ball to a complex humanoid, can be achieved. See videos https://sites.google.com/view/hide-rl.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern Reinforcement learning (RL) can solve sequential-decision making (Mnih et al., 2013; Silver et al., 2017) and continuous control tasks in robotics (Lillicrap et al., 2015; Levine et al., 2015; Schulman et al., 2017). However, tasks that involve extended planning and sparse rewards still pose many challenges in successfully reasoning over long horizons and in achieving generalization from training to different test environments. Hierarchical reinforcement learning (HRL) splits the decision making task into several subtasks (Sutton et al., 1999; Andre and Russell, 2002), which in practice are often learned separately via curriculum learning (Frans et al., 2017; Florensa et al., 2017; Bacon et al., 2016; Vezhnevets et al., 2017). Off-policy training and goal-conditioning have been shown to be effective means for joint learning of hierarchies (Levy et al., 2019; Nachum et al., 2018, 2019). However, these methods often struggle to generalize to unseen environments as we show in Section 5.1. We argue that this is due to an inherent lack of separation of concerns in the design of the hierarchies.

Figure 1: Our functionally decomposed hierarchical architecture allows training higher-level Planning and Interface policies with simple agents and combining them with more complex Control policies. Example transfer of the planning and interface layer of a simple 2DoF ball agent trained on a locomotion task to a manipulation robot to solve more complex tasks such as pushing a cube around obstacles.

In this paper, we study how a more explicit hierarchical task decomposition into local control and more global planning tasks can alleviate both issues. In particular, we hypothesize that decoupling of the state-action spaces of different layers leads to a task decomposition that is beneficial for generalization across agents and environments without retraining.

More specifically, we introduce a 3-level hierarchy that decouples complex control tasks into global planning and local low-level control (see Figure 3, left). The functional decomposition into sub-tasks is explicitly enforced during training by restricting the type of information that is available to each layer. Global environment information is only available to the planning layer, whereas the full internal state of the agent is only accessible by the control layer. Furthermore, the actions of the top and middle layer are displacements in space in the global- (top-layer) and agent-centric frame (middle-layer). The lowest layer only outputs low-level control commands.

The benefit of this explicit task decomposition is manyfold. First, individual layers have access only to task-relevant information, enabling policies to generalize to unseen test configurations, where previous approaches do not. Second, the modularity allows for the composition of new agents without retraining. We demonstrate this via transferring of the planning layer across different low-level agents ranging from a simple 2DoF ball to a humanoid. The approach even allows to generalize across domains, combining layers from navigation and robotic manipulation tasks to solve a compound navigation and manipulation task (see Figure 1).

In our framework (see Figure 2), the planner learns to find a trajectory from a top-down view. A value map of the environment is learned via a value propagation network (Nardelli et al., 2019). The action of is the position that maximizes the masked value map and is fed as goal to mid-level policy . This interface layer refines the goals into more reachable targets for the agent. The lowest layer has access to the proprioceptive state of the agent and learns a control policy to steer the agent to the intermediate goals. While the policies are functionally decoupled, they are trained together and must learn to cooperate.

In our experiments, we first show in a navigation environment that generalization causes challenges for state-of-the-art approaches. We then demonstrate that our method can generalize to randomly configured environment layouts. We also show the benefits of functional decomposition via transfer of individual layers between different agents and even domains. The results indicate that the proposed decomposition of policy layers is effective and can generalize to unseen environments. In summary our main contributions include:

  • A novel multi-layer HRL architecture that allows functional decomposition and temporal abstraction for continuous control problems that require planning.

  • This architecture enables generalization beyond training conditions and environments.

  • Demonstration of transfer of individual layers across different agents and domains.

Figure 2: Our -layer HRL architecture. The planning layer receives a birds eye view of the environment and the agent’s position and sets an intermediate target position . The interface layer splits this subgoal into reachable targets . A goal-conditioned control policy learns the required motor skills to reach the target given the agent’s joint information .

2 Related Work

2.1 Hierarchical Reinforcement Learning

Learning hierarchical policies has seen lasting interest (Sutton et al., 1999; Schmidhuber, 1991; Dietterich, 1999; Parr and Russell, 1998; McGovern and Barto, 2001; Dayan and Hinton, 2000), but many approaches are limited to discrete domains or induce priors.

More recent works (Bacon et al., 2016; Vezhnevets et al., 2017; Tirumala et al., 2019; Nachum et al., 2018; Levy et al., 2019) have demonstrated HRL architectures in continuous domains. Sasha et. al (2017) introduce FeUdal Networks (FUN), inspired by (Dayan and Hinton, 2000). In FUN, a hierarchic decomposition is achieved via a learned state representation in latent space, but is limited to discrete action spaces. Tirumala et. al (2019) introduce hierarchical structure into KL-divergence regularized RL using latent variables and induces semantically meaningful representations. The separation of concerns between high-level and low-level policy is guided by information asymmetry theory.

Nachum et. al (2018) present HIRO, an off-policy HRL method with two levels of hierarchy. The non-stationary signal of the upper policy is mitigated via off-policy corrections, while the lower control policy benefits from densely shaped rewards. Nachum et. al (2019) propose an extension of HIRO, which we call HIRO-LR, that learns a representation space from environment images, replacing the state and subgoal space with neural representations. Levy et. al (2019) introduce Hierarchical Actor-Critic (HAC), an approach that can jointly learn multiple policies in parallel. The policies are trained in sparse reward environments via different hindsight techniques.

HAC, HIRO and HIRO-LR consist of a set of nested policies where the goal of a policy is provided by the top layer. In this setting the goal and state space of the lower policy is identical to the action space of the upper policy. This necessitates sharing of the state space across layers. Overcoming this limitation, we introduce a modular design to decouple the functionality of individual layers. This allows us to define different state, action and goal spaces for each layer. Global information about the environment is only available to the planning layer, while lower levels only receive information that is specific to the respective layer. Furthermore, HAC and HIRO have a state space that includes the agent’s position and the goal position, while HIRO-LR and our method both have access to global information in the form of a top-down view image. Although the learned space representation of HIRO-LR can generalize to a flipped environment, it needs to be retrained, as do HIRO and HAC. Contrarily, HiDe generalizes without retraining.

2.2 Planning in Reinforcement Learning

In model-based reinforcement learning much attention has been given to learning of a dynamics model of the environment and subsequent planning (Sutton, 1990; Sutton et al., 2012; Wang et al., 2019). Eysenbach et. al (2019) propose a planning method that performs a graph search over the replay buffer. However, they require to spawn the agent at different locations in the environment and let it learn a distance function in order to build the search graph. Unlike model-based RL, we do not learn state transitions explicitly. Instead, we learn a spatial value map from collected rewards.

Recently, differentiable planning modules that can be trained via model-free reinforcement learning have been proposed (Tamar et al., 2016; Oh et al., 2017; Nardelli et al., 2019; Srinivas et al., 2018). Tamar et. al (2016)

establish a connection between convolutional neural networks and Value Iteration

(Bertsekas, 2000). They propose Value Iteration Networks (VIN), an approach where model-free RL policies are additionally conditioned on a fully differrentiable planning module. MVProp (Nardelli et al., 2019) extends this work by making it more parameter-efficient and generalizable. The planning layer in our approach is based on MVProp, however contrary to prior work we do not rely on a fixed neighborhood mask to sequentially provide actions in its vicinity in order to reach a goal. Instead we propose to learn an attention mask which is used to generate intermediate goals for the underlying layers.

Gupta et. al (2017) learn a map of indoor spaces and do planning using a multi-scale VIN. Moreover, the robot operates on discrete set of high level macro actions. Nasiriany et. al (2019) use a goal-conditioned policy for learning a TDM-based planner on latent representations. Srinivas et. al (2018) propose Universal Planning Networks (UPN), which also learn how to plan an optimal action trajectory via a latent space representation. In contrast to our approach, the latter methods either rely on expert demonstrations or need to be retrained in order to achieve transfer to harder tasks.

3 Background

3.1 Goal-Conditioned Reinforcement Learning

We model a Markov Decision Process (MDP) augmented with a set of goals

. We define the MDP as a tuple , where and are set of states and actions, respectively, a reward function, a discount factor , the transition dynamics of the environment and the initial state distribution, with and . Each episode is initialized with a goal and an initial state is sampled from . We aim to find a policy , which maximizes the expected return.

We train our policies by using an actor-critic framework where the goal augmented action-value function is defined as:

(1)

The Q-function (critic) and the policy (actor) are approximated by using neural networks with parameters and . The objective for minimizes the loss:

(2)

The policy parameters are trained to maximize the Q-value:

(3)

To address the issues with sparse rewards, we utilize Hindsight Experience Replay (HER) (Andrychowicz et al., 2017), a technique to improve sample-efficiency in training goal-conditioned environments. The insight is that the desired goals of transitions stored in the replay buffer can be relabeled as states that were achieved in hindsight. Such data augmentation allows learning from failed episodes, which may generalize enough to solve the intended goal.

3.2 Hindsight Techniques

In HAC, Levy et. al (2019) apply two hindsight techniques to address the challenges introduced by the non-stationary nature of hierarchical policies and the environments with sparse rewards. In order to train a policy , optimal behavior of the lower-level policy is simulated by hindsight action transitions. More specifically, the action of the upper policy is replaced with a state that is actually achieved by the lower-level policy . Identically to HER, hindsight goal transitions replace the subgoal with an achieved state , which consequently assigns a reward to the lower-level policy for achieving the virtual subgoal. Additionally, a third technique called subgoal testing is proposed. The incentive of subgoal testing is to help a higher-level policy understand the current capability of a lower-level policy and to learn Q-values for subgoal actions that are out of reach. We find both techniques effective and apply them to our model during training.

3.3 Value Propagation Networks

Tamar et. al (2016) propose differentiable value iteration networks (VIN) for path planning and navigation problems. Nardelli et. al (2019) propose value propagation networks (MVProp) with better sample efficiency and generalization behavior. MVProp creates reward- and propagation maps covering the environment. The reward map highlights the goal location and the propagation map determines the propagation factor of values through a particular location. The reward map is an image of the same size as the environment image , where if the pixel overlaps with the goal position and otherwise. The value map

is calculated by unrolling max-pooling operations in a neighborhood

for steps as follows:

(4)

The action (i.e., the target position) is selected to be the pixels maximizing the value in a predefined neighborhood of the agent’s current position :

(5)

Note that the window is determined by the discrete, pixel-wise actions.

4 Hierarchical Decompositional Reinforcement Learning

Figure 3: Planner layer . Given the top-down view environment image and goal on the map, the maximum value propagation network (MVProp) calculates a value map . By using the agent’s current position

, we estimate an attention mask

restricting the global value map to a local and reachable subgoal map . The policy selects the coordinates with maximum value and assigns the next lower policy with a sugboal that is relative to the agent’s current position.

We introduce a novel hierarchical architecture, HiDe, allowing for an explicit functional decomposition across layers. Similar to HAC (Levy et al., 2019), our method achieves temporal abstractions via nested policies. Moreover, our architecture enables functional decomposition explicitly. This is achieved by nesting i) an abstract planning layer, followed ii) by a local planer to iii) guide a control component. Crucially, only the top layer receives global information and is responsible for planning a trajectory towards a goal. The lowest layer learns a control policy for agent locomotion. The middle layer converts the planning layer’s input into subgoals for the control layer. Achieving functional decoupling across layers crucially depends on reducing the state in each layer to the information that is relevant to its specific task. This design significantly improves generalization (see Section 5).

4.1 Planning Layer

The highest layer of a hierarchical architecture is expected to learn high-level actions over a longer horizon, which define a coarse trajectory in navigation-based tasks. In the related work (Levy et al., 2019; Nachum et al., 2018, 2019), the planning layer, learning an implicit value function, shares the same architecture as lower layers. Since the task is learned for a specific environment, limits to generalization are inherent to this design choice. In contrast, we introduce a planning specific layer consisting of several components to learn the map and to find a feasible path to the goal.

The planning layer is illustrated in Figure 3. We utilize a value propagation network (MVProp) (Nardelli et al., 2019) to learn an explicit

value map which projects the collected rewards onto the environment image. Given a top-down image of the environment, a convolutional network determines the per pixel flow probability

. For example, the probability value of a pixel corresponding to a wall should be and that for free passages respectively.

Nardelli et. all (2019) use a predefined neighborhood of the agent’s current position and pass the location of the maximum value in this neighbourhood as goal position to the agent (Equation 5

). We augment a MVProp network with an attention model which learns to define the neighborhood dynamically and adaptively. Given the value map

and the agent’s current position , we estimate how far the agent can go, modeled by a 2D Gaussian. More specifically, we predict a full covariance matrix with the agent’s global position as mean. We later build a 2D mask of the same size as the environment image by using the likelihood function:

(6)

Intuitively, the mask defines the density for the agent’s success rate. Our planner policy selects an action (i.e., subgoal) that maximizes the masked value map as follows:

(7)

where corresponds to the value at pixel on the masked value map . Note that the subgoal selected by the planning layer is relative to the agent’s current position , resulting in a better generalization performance.

The benefits of having an attention model are twofold. First, the planning layer considers the agent dynamics in assigning subgoals which may lead to fine- or coarse-grained subgoals depending on the underlying agent’s performance. Second, the Gaussian window allows us to define a dynamic set of actions for the planner policy , which is essential to find a trajectory of subgoals on the map. While the action space includes all pixels of the value map , it is limited to the subset of only reachable pixels by the Gaussian mask . We find that this leads to better obstacle avoidance behaviour such as the corners and walls shown in Figure 8 in the Appendix.

Since our planner layer operates in a discrete action space (i.e., pixels), the resolution of the projected maze image defines the minimum amount of displacement for the agent, affecting maneuverability. This could be tackled by using a soft-argmax (Chapelle and Wu, 2010)

to select the subgoal pixel, allowing to choose real-valued actions and providing in-variance to image resolution. In our experiments we see no difference in terms of the final performance. However, since the former setting allows for the use of DQN

(Mnih et al., 2013) instead of DDPG (Silver et al., 2014), we prefer the discrete action space for simplicity and faster convergence.

Both the MVProp (Equation 4) and Gaussian likelihood (Equation 6) operations are differentiable. Hence, MVProp and the attention model parameters are trained by minimizing the standard mean squared Bellman error objective as defined in Equation 2.

4.2 Interface Layer

The middle layer in our hierarchy interfaces the high-level planning with low-level control by introducing an additional level of temporal abstraction. The planner’s longer-term goals are further split into a number of shorter-term targets. Such refinement policy provides the lower-level control layer with reachable targets, which in return yields easier rewards and hence accelerated learning.

The interface layer policy is the only layer that is not directly interacting with the environment. More specifically, the policy only receives the subgoal from the upper layer and chooses an action (i.e. subgoal ) for the lower-level locomotion layer . Note that all the goal, state and action spaces of the policy are in 2D space. Contrary to (Levy et al., 2019), we use subgoals that are relative to the agent’s position . This helps to learn symmetrical gaits and helps to generalize.

4.3 Control Layer

The lowest layer learns a goal-conditioned control policy. Due to our explicit functional decomposition, it is the only layer with access to the agent’s internal state including joint positions and velocities. Meanwhile, the higher layers only have access to the agent’s position. In the control tasks considered, an agent has to learn a policy to reach a certain goal position, e.g., reach a target position in a navigation domain. Similar to HAC, we use hindsight goal transition techniques so that the control policy receives rewards even in failure cases.

All policies in our hierarchy are jointly-trained. We use the DDPG algorithm (Lillicrap et al., 2015) with the goal-augmented actor-critic framework (Equations 2-3) for the control and interface layers, and DQN (Mnih et al., 2013) for the planning layer (see Section 4.1).

5 Experiments

We evaluate our method in a series of continuous control tasks111See: https://sites.google.com/view/hide-rl. All environments are simulated in the MuJoCo physics engine (Todorov et al., 2012). Experiment and implementation details are provided in the Appendix B. First, in Section 5.1, we compare to various baseline methods. In Section 5.2, we move to an environment with a more complex design in order to show our model’s generalization capabilities. In Section 5.3, we provide an ablation study for our design choices. Finally, Section 5.4 demonstrates that our approach indeed leads to functional decomposition by transferring layers across agents and domains. We introduce the following task configurations (see Figure 5):

Maze Forward The training environment in all experiments. The task is to reach a goal from a fixed start position.
Maze Backward The training maze layout with swapped start and goal positions.
Maze Flipped Mirrored training environment with swapped starting and goal positions.
Maze Random Randomly generated mazes with random start and goal positions, used in the complex maze (5.2), ablation (5.3) and transfer (5.4) experiments.

We always train in the Maze Forward environment. The reward signal during training is constantly -1, unless the agent reaches the given goal (except for HIRO and HIRO-LR, see Section 5.1). We test the agents on the above tasks with fixed starting and fixed goal position. For more details about the environments, we refer to Appendix A. We intend to answer the following two questions: 1) Can our method generalize to unseen test environment layouts? 2) Is it possible to transfer the planning layes between agents? 3) Does task decomposition lead to generalization across domains?

Figure 4: Simple maze environments for the experiments reported in Section 5.1. The red sphere indicates the goal. Agents are trained in only Forward and tested in Backward and Flipped environments.

5.1 Simple Maze Navigation

We compare our method to state-of-the-art approaches including HIRO (Nachum et al., 2019), HIRO-LR (Nachum et al., 2019), HAC (Levy et al., 2019) and a modified version of HAC called RelHAC in a simple Maze Forward environment as shown in Figure 4 (a). To ensure fair comparison, we made a number of improvements to the HAC and HIRO implementations. For HAC, we introduced target networks and used the hindsight experience replay technique with the future strategy (Andrychowicz et al., 2017). We observed that oscillations close to the goal position kept HIRO agents from finishing the task successfully. We solved this issue via doubling the distance-threshold for success. HIRO-LR is the closest related work to our method, since it also receives a top-down view image of the environment. Note that both HIRO and HIRO-LR have access to dense negative distance rewards, which is an advantage over HAC and HiDe which only receive a reward when finishing the task.

The modified HAC implementation (RelHAC) uses the same lower and middle layers as HiDe but we do not decouple state-action spaces as is done in HiDe. Instead RelHAC simple reuses the same structure for the middle and top layer. Preliminary experiments using fixed start and fixed goal positions during training for HAC, HIRO and HIRO-LR yielded zero success rates in all cases. Therefore, the baseline models are trained with fixed start and random goal positions, allowing them to receive a reward signal without having to reach the intended goal at the other end of the maze. Contrarily, HiDe is trained with fixed start and fixed goal positions, whereas HiDe-R represents HiDe under the same conditions as the baseline methods. See Table 6 in the Appendix for an overview of all the baseline methods.

Methods Forward Backward Flipped
HAC
HIRO
HIRO-LR
RelHAC
HiDe-R
HiDe
Table 1: Success rates of achieving a goal with an Ant agent in a simple maze environment. All algorithms except for HiDe have been trained with randomly sampled goal positions. The results are averaged over 5 seeds.

All models successfully learned this task as shown in Table 1 (Forward column). HIRO demonstrates slightly better performance, which can be attributed to the fact that it is trained with dense rewards. RelHAC performs worse than HAC due to the pruned state space of each layer and due to the lack of an effective planner.

Table 1 also summarizes the models’ generalization abilities to the unseen Maze Backward and Maze Flipped environments (see Figure 5). While HIRO, HIRO-LR and HAC manage to solve the training environment (Maze Forward) with success rates between 99% and 82%, they suffer from overfitting to the training environment, indicated by the 0% success rates in the unseen test environments. Contrarily, our method is able to achieve 54% and 69% success rates in this generalization task. As expected, training our model with random goal positions (i.e., HiDe-R) yields a more robust model outperforming vanilla HiDe.

Figure 5: Complex navigation environments. The red sphere indicates the goal an agent needs to reach, with the starting point at the opposite end of the maze. The agent is trained in forward environment (a). To test generalization, we use the environments Backward with reversed starting and goal positions (b), Flipped with mirrored maze with reversed starting and goal positions (c) and with Random layouts (d).

In subsequent experiments, we only report the results for our method, since our experiments have shown that the baseline methods cannot solve the training task for the more complex environments which we present next. We hypothesize that the exploration capabilities of these methods are not sufficient to learn the task.

5.2 Complex Maze Navigation

In this experiment, we train an ant and a ball agent (for details please see Appendix A.1) in the Maze Forward task with a more complex environment layout (cf. Figure 5), i.e., we increase the size of the environment and the number of obstacles, thereby also increasing the distance to the final reward. We keep both the start and goal positions intact and evaluate this model in different tasks (see Section 5).

Table 2 reports success rates of both agents in this complex task. Our model successfully learns the training task, showing that it is able to scale to larger environments with longer horizons. The performance in the unseen testing environments Maze Backward and Maze Flipped is similar to the results shown in Section 5.1 despite the increased difficulty. Since the randomly generated mazes are typically easier, our model shows similar or better performance.

5.3 Ablation studies

To support the claim that our architectural design choices lead to better generalization and functional decomposition, we compare empirical results of different variants of our method with the ant agent. First, we compare the performance of relative and absolute positions for the experiment reported in Section 5.2. For this reason, we train HiDe-A and HiDe-AR, the corresponding variants of HiDe and HiDe-R that use absolute positions. Unlike the case of relative positions, the policy needs to learn all values within the range of the environment dimensions in this setting. Second, we compare HiDe against a variant of HiDe without the interface layer called HiDe-NI.

The results for the complex maze task configuration of the ablation experiments are summarized in Table 3. Both HiDe-A and HiDe-AR fail to solve the complex maze tasks. These results indicate that relative positions improve performance and are an important aspect of our method to scale to more complex environments and help generalization to other environment layouts. As seen in Table 3, the variant of HiDe without an interface layer (HiDe-NI) performs worse than HiDe (cf. Table 2) in all experiments. This indicates that the interface layer is an important part of our architecture.

We also run an ablation study for HiDe with a fixed window size. More specifically, we train and evaluate an ant agent on window sizes , , and . The results are included in Tables 13,14, and 15

in the Appendix. The learned attention window (HiDe) achieves better or comparable performance. In all cases, HiDe generalizes better in the Maze Backward variant of the complex maze. Moreover, the learned attention eliminates the need for tuning the window size hyperparameter per agent and environment.

Agents Forward Random Backward Flipped
Ant
Ball
Table 2: Success rates in the complex maze. We train an ant and a ball on Forward maze and test on unseen Random, Backward, and Flipped environments.
Ant agent Forward Random Backward Flipped
HiDe-A
HiDe-AR
HiDe-NI
Table 3: Success rates of achieving a goal in the complex maze environment. HiDe-A and HiDe-AR as in Table 5 in the Appendix. HiDe-NI is our method without the inferface layer.

5.4 Transfer of Policies

We argue that a key to better generalization behavior in hierarchical RL lies in enforcing a separation of concerns across the different layers. To whether the overall task is really split into separate sub-tasks we perform sequence of experiments that transfer parts of the policy across agents and tasks.

5.4.1 Agent Transfer

We transfer individual layers across different agents to demonstrate that each part of the hierarchy indeed learns different sub-tasks. We first train an agent without our planning layer, i.e., with RelHAC. We then replace the top layer of this agent with the planning layer from the models trained in Section 5.2. Additionally, we train a humanoid agent and show as a proof of concept that transfer to a very complex agent can be achieved.

We carry out two sets of experiments. First, we transfer the ant model’s planning layer to the simpler DoF ball agent. As indicated in Table 4, the performance of the ball with the ant’s planning layer matches the results of the ball trained end-to-end with HiDe (cf. Table 2). The ball agent’s success rate increases for the Maze Random (from to ) and Maze forward ( to ), whereas it decreases slightly in the Maze Backward (from to ) and Maze Flipped (from to ) task configurations.

Second, we attach the ball agent’s planning layer to the more complex ant agent. The newly composed agent performs marginally better or worse in the Maze Flipped, Maze Random and Maze Backward tasks. Please note that this experiment is an example of a case where the environment is first learned with a fast and easy-to-train agent (i.e., ball) and then utilized by a more complex agent. We hereby show that transfer of layers between agents is possible and therefore find our hypothesis to be valid. Moreover, an estimate indicates that the training is roughly 3 – 4 times faster, since the complex agent does not have to learn the planning layer.

To further demonstrate our method’s transfer capabilities, we train a humanoid agent (17 DoF) in an empty environment with shaped rewards. We then use the planning and interface layer from a ball agent and connect it as is with the locomotion layer of the trained humanoid222Videos available at https://sites.google.com/view/hide-rl.

Transfer Forward Random Backward Flipped
Ant Ball
Ball Ant
Table 4: Success rates of achieving a goal in complex maze environment with transferred agents. We train with a different agent on Forward maze.

5.4.2 Domain Transfer

In this experiment, we demonstrate the capability of HiDe to transfer the planning layer from a simple ball agent, trained on a pure locomotion task, to a robot manipulation agent. The goal of this experiment is not to compete with state-of-the-art manipulation algorithms, but rather to highlight both our contributions, i.e., i) transfer of the modular layers across agents and domains and ii) generalization to different environment layouts without retraining.

To this end, we train a ball agent with HiDe as described in Section 5.1. Moreover, we train a control policy for a robot manipulation task in the OpenAI Gym ”Push” environment (Brockman et al., 2016), which learns to push a cube-sized object to a robot relative position goal. Note that the manipulation task does not encounter any obstacles during training. To attain the final compound agent, we then attach the planning and interface layer of the ball agent to the goal-conditioned manipulation policy (cf. Figure 1). For testing, we generate 500 random environment layouts similar to the Random task described in Section 5. Similar to the navigation experiments in Section 5.1, state-of-the-art methods are able to solve these tasks when trained on a single environment layout. However, they do not generalize to other layouts without retraining. In contrast our evaluation of the compound HiDe agent on unseen testing layouts shows a success rate of 49. Note that the control layer has never been exposed to obstacles before. Thus, our modular approach can achieve domain transfer and generalize to different environments without retraining2.

Figure 6: Example of three randomly configured test environments we use to demonstrate the domain transfer of the planning layer from a locomotion domain to a manipulation robot.

6 Conclusion

In this paper, we introduce a novel HRL architecture that can solve complex control tasks in 3D-based environments. The architecture consists of a planning layer which learns an explicit value map and is connected with a subgoal refinement layer and a low-level control layer. The framework can be trained end-to-end. While training with a fixed starting and goal position, our method is able to generalize to previously unseen settings and environments. Furthermore, we demonstrate that transfer of planners between different agents can be achieved, enabling us to move a planner trained with a simplistic agent to a more complex agent, such as a humanoid or a robot manipulator. The key insight lies in the strict separation of concerns across layers which is achieved via decoupled state-action spaces and restricted access to global information. In the future, we want to extend our method to a 3D-based planning layer connected with a 3D attention mechanism.

References

  • D. Andre and S. J. Russell (2002) State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, pp. 119–125. Cited by: §1.
  • M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel, and W. Zaremba (2017) Hindsight Experience Replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: 11st item, §B.1, §3.1, §5.1.
  • P. Bacon, J. Harb, and D. Precup (2016) The option-critic architecture. abs/1609.05140. External Links: 1609.05140 Cited by: §1, §2.1.
  • D. P. Bertsekas (2000) Dynamic programming and optimal control. 2nd edition, Athena Scientific. External Links: ISBN 1886529094 Cited by: §2.2.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §A.1, §B.2, §5.4.2.
  • O. Chapelle and M. Wu (2010) Gradient descent optimization of smoothed information retrieval metrics. Information retrieval 13 (3), pp. 216–235. Cited by: §4.1.
  • P. Dayan and G. Hinton (2000) Feudal reinforcement learning. Advances in Neural Information Processing Systems 5, pp. . External Links: Document Cited by: §2.1, §2.1.
  • T. G. Dietterich (1999) Hierarchical reinforcement learning with the MAXQ value function decomposition. cs.LG/9905014. Cited by: §2.1.
  • Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking Deep Reinforcement Learning for Continuous Control. In

    International Conference on Machine Learning

    ,
    pp. 1329–1338. Cited by: §A.1.
  • B. Eysenbach, R. Salakhutdinov, and S. Levine (2019) Search on the Replay Buffer: Bridging Planning and Reinforcement Learning. arXiv preprint arXiv:1906.05253. Cited by: §2.2.
  • C. Florensa, Y. Duan, and P. Abbeel (2017) Stochastic neural networks for hierarchical reinforcement learning. abs/1704.03012. External Links: 1704.03012 Cited by: §1.
  • K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman (2017) Meta learning shared hierarchies. abs/1710.09767. External Links: 1710.09767 Cited by: §1.
  • S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2017) Cognitive mapping and planning for visual navigation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2616–2625. Cited by: §2.2.
  • S. Levine, C. Finn, T. Darrell, and P. Abbeel (2015) End-to-end training of deep visuomotor policies. abs/1504.00702. External Links: 1504.00702 Cited by: §1.
  • A. Levy, R. Platt, and K. Saenko (2019) Learning Multi-Level Hierarchies with Hindsight. In International Conference on Learning Representations, Cited by: §A.1, Appendix A, 4th item, 9th item, §B.3.2, §1, §2.1, §2.1, §3.2, §4.1, §4.2, §4, §5.1.
  • T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous Control with Deep Reinforcement Learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §4.3.
  • A. McGovern and A. G. Barto (2001) Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, pp. 361–368. External Links: ISBN 1-55860-778-1 Cited by: §2.1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller (2013) Playing atari with deep reinforcement learning. abs/1312.5602. External Links: 1312.5602 Cited by: §1, §4.1, §4.3.
  • O. Nachum, S. Gu, H. Lee, and S. Levine (2019) Near-Optimal Representation Learning for Hierarchical Reinforcement Learning. In International Conference on Learning Representations, Cited by: §1, §2.1, §4.1, §5.1.
  • O. Nachum, S. S. Gu, H. Lee, and S. Levine (2018) Data-efficient Hierarchical Reinforcement Learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §A.2.1, Appendix A, §1, §2.1, §2.1, §4.1.
  • N. Nardelli, G. Synnaeve, Z. Lin, P. Kohli, P. H. S. Torr, and N. Usunier (2019) Value Propagation Networks. In International Conference on Learning Representations, Cited by: §B.3.1, §1, §2.2, §3.3, §4.1, §4.1.
  • S. Nasiriany, V. Pong, S. Lin, and S. Levine (2019) Planning with Goal-Conditioned Policies. In Advances in Neural Information Processing Systems, pp. 14814–14825. Cited by: §2.2.
  • J. Oh, S. Singh, and H. Lee (2017) Value prediction network. abs/1707.03497. External Links: 1707.03497 Cited by: §2.2.
  • R. Parr and S. Russell (1998) Reinforcement learning with hierarchies of machines. In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10, NIPS ’97, Cambridge, MA, USA, pp. 1043–1049. External Links: ISBN 0-262-10076-2 Cited by: §2.1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in PyTorch

    .
    Cited by: Appendix B.
  • J. Schmidhuber (1991) Learning to generate subgoals for action sequences. IJCNN-91-Seattle International Joint Conference on Neural Networks ii, pp. 453 vol.2–. Cited by: §2.1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
  • D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic Policy Gradient Algorithms. In International Conference on Machine Learning, pp. 387–395. Cited by: §4.1.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis (2017) Mastering the game of go without human knowledge. Nature 550, pp. 354–. Cited by: §1.
  • A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn (2018) Universal planning networks. abs/1804.00645. External Links: 1804.00645 Cited by: §2.2, §2.2.
  • R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112 (1-2), pp. 181–211. External Links: ISSN 0004-3702, Document Cited by: §1, §2.1.
  • R. S. Sutton, C. Szepesvári, A. Geramifard, and M. Bowling (2012) Dyna-style planning with linear function approximation and prioritized sweeping. abs/1206.3285. External Links: 1206.3285 Cited by: §2.2.
  • R. S. Sutton (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ML, Cited by: §2.2.
  • A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel (2016) Value Iteration Networks. In Advances in Neural Information Processing Systems, pp. 2154–2162. Cited by: §2.2, §3.3.
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller (2018) DeepMind Control Suite. Technical report Vol. abs/1504.04804. Note: https://arxiv.org/abs/1801.00690 Cited by: §A.1.
  • D. Tirumala, H. Noh, A. Galashov, L. Hasenclever, A. Ahuja, G. Wayne, R. Pascanu, Y. W. Teh, and N. Heess (2019) Exploiting hierarchy for learning and transfer in kl-regularized RL. CoRR abs/1903.07438. External Links: Link, 1903.07438 Cited by: §2.1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: Appendix A, §5.
  • A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu (2017) FeUdal networks for hierarchical reinforcement learning. abs/1703.01161. External Links: 1703.01161 Cited by: §1, §2.1.
  • T. Wang, X. Bao, I. Clavera, J. Hoang, Y. Wen, E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba (2019) Benchmarking model-based reinforcement learning. abs/1907.02057. External Links: 1907.02057 Cited by: §2.2.

Appendix A Environment details

We build on the Mujoco (Todorov et al., 2012) environments used in (Nachum et al., 2018). All environments use . Each episode in experiment 1 is terminated after 500 steps and after 800 steps in the rest of the experiments or after the goal in reached. All rewards are sparse as in (Levy et al., 2019), i.e., for reaching the goal and otherwise. We consider goal reached if . Since HIRO sets the goals in the far distance to encourage the lower layer to move to the goal faster, it oscillates around the target position. Thus for HIRO, we consider a goal reached if .

a.1 Agents

Our ant agent is equivalent to the one in (Levy et al., 2019). In other words, the ant from Rllab (Duan et al., 2016) with gear power of 16 instead of 150 and 10 frame skip instead of 5. Our ball agent is the PointMass agent from DM Control Suite (Tassa et al., 2018). We changed the joints so that the ball rolls instead of sliding. Furthermore, we resize the motor gear and the ball itself to match the maze size. For the manipulation robot, we slightly adapt the ”Push” task from OpenAI gym (Brockman et al., 2016). The original environment uses an inverse kinematic controller to steer the robot, whereas joint positions are enforced and realistic physics are ignored. This can cause unwanted behavior, such as penetration through objects. Hence, we change the control inputs to motor torques for the joints.

a.2 Environments

a.2.1 Locomotion mazes

All locomotion mazes are modelled by immovable blocks of size . (Nachum et al., 2018) uses blocks of . The environment shapes are clearly depicted in 5. For the randomly generated maze, we sample each block with probability being empty . The start and goal positions are also sampled randomly at uniform. Mazes where start and goal positions are adjacent or where the goal is not reachable are discarded. For evaluation, we generated 500 of such environments and reused them (one per episode) for all experiments.

a.2.2 Manipulation environments

The manipulation environments differ from the locomotion mazes in scale. Each wall is of size . We used a layout of blocks. The object position was the position used for the interface layer. When the object escaped the top-down view range, the episodes were terminated. The last observation was only added to the control layer with a subgoal penalty. The random layouts were generated using the same methodology as for the locomotion mazes.

Appendix B Implementation Details

Our PyTorch (Paszke et al., 2017) implementation will be available on the project website. 333HiDe:https://sites.google.com/view/hide-rl

b.1 Baseline experiments

For both HIRO and HAC we used the authors’ original implementations444HIRO:https://github.com/tensorflow/models/tree/master/research/efficient-hrl555HAC:https://github.com/andrew-j-levy/Hierarchical-Actor-Critc-HAC-. We ran the hiro_xy variant, which uses only position coordinates for subgoals instead of all joint positions to have a fair comparison with our method. To improve the performance of HAC in experiment one, we modified their Hindsight Experience Replay (Andrychowicz et al., 2017) implementation so that they use Future strategy. More importantly, we also added target networks to both the actor and critic to improve the performance. We used OpenAI’s baselines (baselines) for the DDPG+HER implementation. When pretraining for domain transfer, we made the achieved goals relative before feeding them into the network. For a better overview, see Table 6.

b.2 Evaluation details

For evaluation, we trained 5 seeds each for 2.5M steps on the “Forward” environment with continuous evaluation (every 100 episodes for 100 episodes). After training, we selected the best checkpoint based on the continuous evaluation of each seed. Then, we tested the learned policies for 500 episodes and reported the average success rate. Although the agent and goal positions are fixed, the initial joint positions and velocities are sampled from uniform distribution as is standard in OpenAI Gym environments

(Brockman et al., 2016). Therefore, the tables in the results (cf. Section 5

) contain means and standard deviations across 5 seeds.

b.3 Network Structure

b.3.1 Planning layer

Input images for the planning layer were binarized in the following way: each pixel corresponds to one block (0 if it was a wall or 1 if it was a corridor). In our planning layer, we process the input image of size 32x32 (20x20 for experiment 1) via two convolutional layers with

kernels. Both layers have only

input and output channel and are padded so that the output size is the same as the input size. We propagate the value through the value map as in 

(Nardelli et al., 2019) times using a max pooling layer. Finally, the value map and agent position image (a black image with a dot at the agent position) is processed by 3 convolutions with output channels and filter window interleaved by max pool with activation functions and zero padding. The final result is flattened and processed by two fully connected layers with neurons, each producing three outputs: with , and activation functions respectively. The final covariance matrix is given by

so that the matrix is always symmetric and positive definite. For numerical reasons, we multiply by the binarized kernel mask instead of the actual Gaussian densities. We set the values greater than the mean to and the others to zeros. In practice, we use this line:

kernel = t.where(kernel >=
 kernel.mean(dim=[1,2], keepdim=True),
 t.ones_like(kernel), t.zeros_like(kernel))

b.3.2 Middle and Locomotion layer

We use the same network architecture for the middle and lower layer as proposed by (Levy et al., 2019), i.e. we use 3 times a fully connected layer with activation function. The locomation layer is activated with , which is then scaled to the action range.

b.3.3 Training parameters

  • Discount for all agents.

  • Adam optimizer. Learning rate for all actors and critics.

  • Soft updates using moving average; for all controllers.

  • Replay buffer size was designed to store 500 episodes, similarly as in (Levy et al., 2019)

  • We performed

    updates after each epoch on each layer, after the replay buffer contained at least 256 transitions.

  • Batch size 1024.

  • Rewards 0 and -1 without any normalization.

  • Subgoal testing (Levy et al., 2019) only for the middle layer.

  • Observations also were not normalized.

  • 2 HER transitions per transition using the future strategy (Andrychowicz et al., 2017).

  • Exploration noise: 0.05, 0.01 and 0.1 for the planning, middle and locomotion layer respectively.

b.4 Computational infrastructure

All HiDe, HAC and HIRO experiments were trained on 1 GPU (GTX 1080). OpenAI DDPG+HER baselines were trained on 19 CPUs using the baseline repository (baselines).

Appendix C Additional results

In this section, we present all results collected for this paper including individual runs.

Figure 7: Success rates for the training task (Forward) wrt the number of environment steps. All algorithms eventually learn the task. HIRO converges the fastest because it benefits from dense rewards. The results are averaged over 5 seeds.
Ant agent Forward Backward Flipped
HiDe-A
HiDe-AR
Table 5: Success rates in the simple maze (cf. Section 5.1). HiDe-A is our method with absolute subgoals. HiDe-AR has absolute goals and samples random goals during training. HiDe-A cannot solve the task, whereas HiDe-AR can learn the task, but does not generalize as well as our method.
Figure 8: A visual comparison of (left) our dynamic attention window with a (right) fixed neighborhood. The green dot corresponds to the selected subgoal in this case. Notice how our window is shaped so that it avoids the wall and induces a further subgoal.
Features HiDe HiDe-R HIRO HIRO-LR HAC DDPG+HER
Images x x x
Random start pos x x x x
Random end pos x
Agent position x
Shaped reward x x x x
EnvGeneral x x x x
Agent transfer x x x x
Table 6: Overview of related work and our method with their respective features. Features marked with a tick are included in the algorithm whereas features marked with a cross are not used. The ticks/crosses are colored green if the use/lack of a feature is generally desired. Red color signifies that the use/lack of a feature is unfavorable. For example, randomized start positions during training are not used in HiDe during training, which is favorable.

Glossary:
Images: If the state space has access to images of the environment.
Random start pos: If the starting position is randomized during training.
Random end pos: If the goal position is randomized during training.
Agent position: If the state space has access to the agent’s position.
Shaped reward: If the algorithm learns using a shaped reward.
EnvGeneral: Whether generalization is possible without retraining.
Agent transfer: Whether transfer of layers between agents is possible without retraining.
Experiment Forward Backward Flipped
HAC 1 96.4 00.0 00.0
HAC 2 82.0 00.0 00.0
HAC 3 85.6 00.4 00.0
HAC 4 92.8 00.0 00.0
HAC 5 55.6 00.0 00.0
HIRO 1 100 00.0 00.0
HIRO 2 99.8 00.0 00.0
HIRO 3 99.0 00.0 00.0
HIRO 4 99.6 00.0 00.0
HIRO 5 97.2 00.0 00.0
HIRO-LR 1 88 00.0 00.0
HIRO-LR 2 100 00.0 00.0
HIRO-LR 3 97.6 00.0 00.0
HIRO-LR 4 98.8 00.0 00.0
HIRO-LR 5 100 00.0 00.0
RelHAC 1 77.6 18.8 00.0
RelHAC 2 86.4 00.0 00.0
RelHAC 3 77.8 00.0 00.0
RelHAC 4 19.6 00.0 00.0
RelHAC 5 71.0 00.0 00.0
HiDe-R 1 87.6 72.2 90.6
HiDe-R 2 87.8 70.4 89.6
HiDe-R 3 86.4 38.2 89.8
HiDe-R 4 90.6 69.4 94.0
HiDe-R 5 94.4 56.2 87.0
HiDe 1 80.6 21.2 00.0
HiDe 2 94.6 71.8 96.8
HiDe 3 81.6 53.4 90.2
HiDe 4 79.4 61.4 91.2
HiDe 5 87.0 66.0 66.6
Table 7: Results for experiment 1 for individual seeds.
Ant 1 Ant 2 Ant 3 Ant 4 Ant 5 Ball 1 Ball 2 Ball 3 Ball 4 Ball 5
Forward 82.0 80 92.2 70.6 82.0 99.6 99.0 100 83.2 100
Random 85.8 88.2 89.2 91.4 92.0 97.0 94.2 97.6 97.2 96.0
Backward 61.8 61.4 56.0 42.0 60.4 100 100 100 100 100
Flipped 87.4 68.2 64.8 64.0 85.4 100 96.4 100 100 100
Table 8: Results for experiment 2 for individual seeds.
AB 1 AB 2 AB 3 AB 4 AB 5 Averaged
Forward 100 100 100 100 100
Random 96.2 96.6 96.8 97.8 96.6
Backward 90.2 99.8 99.8 99.2 100
Flipped 100 100 100 100 100
Table 9: Results for ant to ball transfer for individual seeds.
BA 1 BA 2 BA 3 BA 4 BA 5 Averaged
Forward 45.8 81.2 75.8 64.8 64.4
Random 81.6 85 94.4 83.4 88.0
Backward 49.4 45.4 68.6 48.0 54.6
Flipped 32.0 83.8 79.6 26.8 70.4
Table 10: Results for ball to ant transfer for individual seeds.
Ant 1 Ant 2 Ant 3 Ant 4 Ant 5 Averaged
Forward 94.0 96.0 96.2 95.0 93.4
Backward 84.2 40.6 1.2 59.4 76.4
Flipped 2.0 1.8 75.4 90.4 0.0
Table 11: Results for experiment 1 on HiDe-AR.
Ant 1 Ant 2 Ant 3 Ant 4 Ant 5 Averaged
Forward 7.0 12.6 6.2 16.8 6.2
Random 29.0 40.2 37.6 67.2 57.8
Backward 0.0 0.2 0.0 0.8 0.0
Flipped 0.0 3.6 0.0 9.4 0.0
Table 12: Results for experiment 2 on HiDe without interface layer.
Ant 1 Ant 2 Ant 3 Ant 4 Ant 5 Averaged
Forward 38.6 49.8 42.2 75.6 33.8
Random 69.4 83.8 70.8 86.4 64.2
Backward 7.2 55.4 25.4 72.6 0.0
Flipped 0.0 69.8 0.0 0.0 0.0
Table 13: Results for experiment 2 with fixed 3x3 attention window.
Ant 1 Ant 2 Ant 3 Ant 4 Ant 5 Averaged
Forward 89.0 88.0 78.8 96.4 86.6
Random 87.8 93.0 89.2 92.0 87.0
Backward 58.2 73.6 45.2 0.0 3.2
Flipped 59.0 84.0 46.4 0.0 81.0
Table 14: Results for experiment 2 with fixed 5x5 attention window.
Ant 1 Ant 2 Ant 3 Ant 4 Ant 5 Averaged
Forward 92.0 75.4 80.2 91.0 94.6
Random 84.2 83.4 85.0 91.2 89.2
Backward 6.4 48.2 55.2 85.0 29.8
Flipped 85.2 64.2 81.6 93.6 71.8
Table 15: Results for experiment 2 with fixed 9x9 attention window.
Arm 1 Arm 2 Arm 3 Arm 4 Arm 5 Averaged
Random 50 48 47 48 51
Table 16: Results of the different seeds for the domain transfer experiments.
  Input:
  • Agent position , goal position , and projection from environment coordinates to image coordinates and its inverse ,.

  Parameters:
  1. maximum subgoal horizon , subgoal testing frequency

  Output:
  • trained actor and critic functions

{Train for M episodes}
  for  episodes do
      , {Get initial state and task goal}
     (, )
     Update all actor and critic networks
  end for
  function (
      {Run MVProp on top-down view image and goal position}
      {Predict mask parameters}
     
      {Mask the value map}
     return {Output relative subgoal corresponding to the max value pixel}
  end function
  function (
     return {Output fine-grained relative subgoal}
  end function
  function (
     return {Output actions for actuators}
  end function
  function train_level(
     , {Set current state and goal for level }
     for  attempts or until , achieved do
         + (if not subgoal testing) {Sample (noisy) action from policy}
        if  then
           Determine whether to test subgoal
            () {Train level using subgoal }
        else
           Execute primitive action and observe next state
        end if
        {Create replay transitions}
        if  and not reached then
           if  was subgoal tested then
               {Penalize subgoal }
           end if
            {Replace original action with action executed in hindsight}
        end if
        {Evaluate executed action on current goal and hindsight goals}
        
        
        
     end for
      Perform HER using transitions
     return {Output current state}
  end function
Algorithm 1 Hierarchical Decompositional Reinforcement Learning (HiDe)