Modern Reinforcement learning (RL) can solve sequential-decision making (Mnih et al., 2013; Silver et al., 2017) and continuous control tasks in robotics (Lillicrap et al., 2015; Levine et al., 2015; Schulman et al., 2017). However, tasks that involve extended planning and sparse rewards still pose many challenges in successfully reasoning over long horizons and in achieving generalization from training to different test environments. Hierarchical reinforcement learning (HRL) splits the decision making task into several subtasks (Sutton et al., 1999; Andre and Russell, 2002), which in practice are often learned separately via curriculum learning (Frans et al., 2017; Florensa et al., 2017; Bacon et al., 2016; Vezhnevets et al., 2017). Off-policy training and goal-conditioning have been shown to be effective means for joint learning of hierarchies (Levy et al., 2019; Nachum et al., 2018, 2019). However, these methods often struggle to generalize to unseen environments as we show in Section 5.1. We argue that this is due to an inherent lack of separation of concerns in the design of the hierarchies.
In this paper, we study how a more explicit hierarchical task decomposition into local control and more global planning tasks can alleviate both issues. In particular, we hypothesize that decoupling of the state-action spaces of different layers leads to a task decomposition that is beneficial for generalization across agents and environments without retraining.
More specifically, we introduce a 3-level hierarchy that decouples complex control tasks into global planning and local low-level control (see Figure 3, left). The functional decomposition into sub-tasks is explicitly enforced during training by restricting the type of information that is available to each layer. Global environment information is only available to the planning layer, whereas the full internal state of the agent is only accessible by the control layer. Furthermore, the actions of the top and middle layer are displacements in space in the global- (top-layer) and agent-centric frame (middle-layer). The lowest layer only outputs low-level control commands.
The benefit of this explicit task decomposition is manyfold. First, individual layers have access only to task-relevant information, enabling policies to generalize to unseen test configurations, where previous approaches do not. Second, the modularity allows for the composition of new agents without retraining. We demonstrate this via transferring of the planning layer across different low-level agents ranging from a simple 2DoF ball to a humanoid. The approach even allows to generalize across domains, combining layers from navigation and robotic manipulation tasks to solve a compound navigation and manipulation task (see Figure 1).
In our framework (see Figure 2), the planner learns to find a trajectory from a top-down view. A value map of the environment is learned via a value propagation network (Nardelli et al., 2019). The action of is the position that maximizes the masked value map and is fed as goal to mid-level policy . This interface layer refines the goals into more reachable targets for the agent. The lowest layer has access to the proprioceptive state of the agent and learns a control policy to steer the agent to the intermediate goals. While the policies are functionally decoupled, they are trained together and must learn to cooperate.
In our experiments, we first show in a navigation environment that generalization causes challenges for state-of-the-art approaches. We then demonstrate that our method can generalize to randomly configured environment layouts. We also show the benefits of functional decomposition via transfer of individual layers between different agents and even domains. The results indicate that the proposed decomposition of policy layers is effective and can generalize to unseen environments. In summary our main contributions include:
A novel multi-layer HRL architecture that allows functional decomposition and temporal abstraction for continuous control problems that require planning.
This architecture enables generalization beyond training conditions and environments.
Demonstration of transfer of individual layers across different agents and domains.
2 Related Work
2.1 Hierarchical Reinforcement Learning
Learning hierarchical policies has seen lasting interest (Sutton et al., 1999; Schmidhuber, 1991; Dietterich, 1999; Parr and Russell, 1998; McGovern and Barto, 2001; Dayan and Hinton, 2000), but many approaches are limited to discrete domains or induce priors.
More recent works (Bacon et al., 2016; Vezhnevets et al., 2017; Tirumala et al., 2019; Nachum et al., 2018; Levy et al., 2019) have demonstrated HRL architectures in continuous domains. Sasha et. al (2017) introduce FeUdal Networks (FUN), inspired by (Dayan and Hinton, 2000). In FUN, a hierarchic decomposition is achieved via a learned state representation in latent space, but is limited to discrete action spaces. Tirumala et. al (2019) introduce hierarchical structure into KL-divergence regularized RL using latent variables and induces semantically meaningful representations. The separation of concerns between high-level and low-level policy is guided by information asymmetry theory.
Nachum et. al (2018) present HIRO, an off-policy HRL method with two levels of hierarchy. The non-stationary signal of the upper policy is mitigated via off-policy corrections, while the lower control policy benefits from densely shaped rewards. Nachum et. al (2019) propose an extension of HIRO, which we call HIRO-LR, that learns a representation space from environment images, replacing the state and subgoal space with neural representations. Levy et. al (2019) introduce Hierarchical Actor-Critic (HAC), an approach that can jointly learn multiple policies in parallel. The policies are trained in sparse reward environments via different hindsight techniques.
HAC, HIRO and HIRO-LR consist of a set of nested policies where the goal of a policy is provided by the top layer. In this setting the goal and state space of the lower policy is identical to the action space of the upper policy. This necessitates sharing of the state space across layers. Overcoming this limitation, we introduce a modular design to decouple the functionality of individual layers. This allows us to define different state, action and goal spaces for each layer. Global information about the environment is only available to the planning layer, while lower levels only receive information that is specific to the respective layer. Furthermore, HAC and HIRO have a state space that includes the agent’s position and the goal position, while HIRO-LR and our method both have access to global information in the form of a top-down view image. Although the learned space representation of HIRO-LR can generalize to a flipped environment, it needs to be retrained, as do HIRO and HAC. Contrarily, HiDe generalizes without retraining.
2.2 Planning in Reinforcement Learning
In model-based reinforcement learning much attention has been given to learning of a dynamics model of the environment and subsequent planning (Sutton, 1990; Sutton et al., 2012; Wang et al., 2019). Eysenbach et. al (2019) propose a planning method that performs a graph search over the replay buffer. However, they require to spawn the agent at different locations in the environment and let it learn a distance function in order to build the search graph. Unlike model-based RL, we do not learn state transitions explicitly. Instead, we learn a spatial value map from collected rewards.
Recently, differentiable planning modules that can be trained via model-free reinforcement learning have been proposed (Tamar et al., 2016; Oh et al., 2017; Nardelli et al., 2019; Srinivas et al., 2018). Tamar et. al (2016)
establish a connection between convolutional neural networks and Value Iteration(Bertsekas, 2000). They propose Value Iteration Networks (VIN), an approach where model-free RL policies are additionally conditioned on a fully differrentiable planning module. MVProp (Nardelli et al., 2019) extends this work by making it more parameter-efficient and generalizable. The planning layer in our approach is based on MVProp, however contrary to prior work we do not rely on a fixed neighborhood mask to sequentially provide actions in its vicinity in order to reach a goal. Instead we propose to learn an attention mask which is used to generate intermediate goals for the underlying layers.
Gupta et. al (2017) learn a map of indoor spaces and do planning using a multi-scale VIN. Moreover, the robot operates on discrete set of high level macro actions. Nasiriany et. al (2019) use a goal-conditioned policy for learning a TDM-based planner on latent representations. Srinivas et. al (2018) propose Universal Planning Networks (UPN), which also learn how to plan an optimal action trajectory via a latent space representation. In contrast to our approach, the latter methods either rely on expert demonstrations or need to be retrained in order to achieve transfer to harder tasks.
3.1 Goal-Conditioned Reinforcement Learning
We model a Markov Decision Process (MDP) augmented with a set of goals. We define the MDP as a tuple , where and are set of states and actions, respectively, a reward function, a discount factor , the transition dynamics of the environment and the initial state distribution, with and . Each episode is initialized with a goal and an initial state is sampled from . We aim to find a policy , which maximizes the expected return.
We train our policies by using an actor-critic framework where the goal augmented action-value function is defined as:
The Q-function (critic) and the policy (actor) are approximated by using neural networks with parameters and . The objective for minimizes the loss:
The policy parameters are trained to maximize the Q-value:
To address the issues with sparse rewards, we utilize Hindsight Experience Replay (HER) (Andrychowicz et al., 2017), a technique to improve sample-efficiency in training goal-conditioned environments. The insight is that the desired goals of transitions stored in the replay buffer can be relabeled as states that were achieved in hindsight. Such data augmentation allows learning from failed episodes, which may generalize enough to solve the intended goal.
3.2 Hindsight Techniques
In HAC, Levy et. al (2019) apply two hindsight techniques to address the challenges introduced by the non-stationary nature of hierarchical policies and the environments with sparse rewards. In order to train a policy , optimal behavior of the lower-level policy is simulated by hindsight action transitions. More specifically, the action of the upper policy is replaced with a state that is actually achieved by the lower-level policy . Identically to HER, hindsight goal transitions replace the subgoal with an achieved state , which consequently assigns a reward to the lower-level policy for achieving the virtual subgoal. Additionally, a third technique called subgoal testing is proposed. The incentive of subgoal testing is to help a higher-level policy understand the current capability of a lower-level policy and to learn Q-values for subgoal actions that are out of reach. We find both techniques effective and apply them to our model during training.
3.3 Value Propagation Networks
Tamar et. al (2016) propose differentiable value iteration networks (VIN) for path planning and navigation problems. Nardelli et. al (2019) propose value propagation networks (MVProp) with better sample efficiency and generalization behavior. MVProp creates reward- and propagation maps covering the environment. The reward map highlights the goal location and the propagation map determines the propagation factor of values through a particular location. The reward map is an image of the same size as the environment image , where if the pixel overlaps with the goal position and otherwise. The value map
is calculated by unrolling max-pooling operations in a neighborhoodfor steps as follows:
The action (i.e., the target position) is selected to be the pixels maximizing the value in a predefined neighborhood of the agent’s current position :
Note that the window is determined by the discrete, pixel-wise actions.
4 Hierarchical Decompositional Reinforcement Learning
We introduce a novel hierarchical architecture, HiDe, allowing for an explicit functional decomposition across layers. Similar to HAC (Levy et al., 2019), our method achieves temporal abstractions via nested policies. Moreover, our architecture enables functional decomposition explicitly. This is achieved by nesting i) an abstract planning layer, followed ii) by a local planer to iii) guide a control component. Crucially, only the top layer receives global information and is responsible for planning a trajectory towards a goal. The lowest layer learns a control policy for agent locomotion. The middle layer converts the planning layer’s input into subgoals for the control layer. Achieving functional decoupling across layers crucially depends on reducing the state in each layer to the information that is relevant to its specific task. This design significantly improves generalization (see Section 5).
4.1 Planning Layer
The highest layer of a hierarchical architecture is expected to learn high-level actions over a longer horizon, which define a coarse trajectory in navigation-based tasks. In the related work (Levy et al., 2019; Nachum et al., 2018, 2019), the planning layer, learning an implicit value function, shares the same architecture as lower layers. Since the task is learned for a specific environment, limits to generalization are inherent to this design choice. In contrast, we introduce a planning specific layer consisting of several components to learn the map and to find a feasible path to the goal.
value map which projects the collected rewards onto the environment image. Given a top-down image of the environment, a convolutional network determines the per pixel flow probability. For example, the probability value of a pixel corresponding to a wall should be and that for free passages respectively.
). We augment a MVProp network with an attention model which learns to define the neighborhood dynamically and adaptively. Given the value mapand the agent’s current position , we estimate how far the agent can go, modeled by a 2D Gaussian. More specifically, we predict a full covariance matrix with the agent’s global position as mean. We later build a 2D mask of the same size as the environment image by using the likelihood function:
Intuitively, the mask defines the density for the agent’s success rate. Our planner policy selects an action (i.e., subgoal) that maximizes the masked value map as follows:
where corresponds to the value at pixel on the masked value map . Note that the subgoal selected by the planning layer is relative to the agent’s current position , resulting in a better generalization performance.
The benefits of having an attention model are twofold. First, the planning layer considers the agent dynamics in assigning subgoals which may lead to fine- or coarse-grained subgoals depending on the underlying agent’s performance. Second, the Gaussian window allows us to define a dynamic set of actions for the planner policy , which is essential to find a trajectory of subgoals on the map. While the action space includes all pixels of the value map , it is limited to the subset of only reachable pixels by the Gaussian mask . We find that this leads to better obstacle avoidance behaviour such as the corners and walls shown in Figure 8 in the Appendix.
Since our planner layer operates in a discrete action space (i.e., pixels), the resolution of the projected maze image defines the minimum amount of displacement for the agent, affecting maneuverability. This could be tackled by using a soft-argmax (Chapelle and Wu, 2010)
to select the subgoal pixel, allowing to choose real-valued actions and providing in-variance to image resolution. In our experiments we see no difference in terms of the final performance. However, since the former setting allows for the use of DQN(Mnih et al., 2013) instead of DDPG (Silver et al., 2014), we prefer the discrete action space for simplicity and faster convergence.
4.2 Interface Layer
The middle layer in our hierarchy interfaces the high-level planning with low-level control by introducing an additional level of temporal abstraction. The planner’s longer-term goals are further split into a number of shorter-term targets. Such refinement policy provides the lower-level control layer with reachable targets, which in return yields easier rewards and hence accelerated learning.
The interface layer policy is the only layer that is not directly interacting with the environment. More specifically, the policy only receives the subgoal from the upper layer and chooses an action (i.e. subgoal ) for the lower-level locomotion layer . Note that all the goal, state and action spaces of the policy are in 2D space. Contrary to (Levy et al., 2019), we use subgoals that are relative to the agent’s position . This helps to learn symmetrical gaits and helps to generalize.
4.3 Control Layer
The lowest layer learns a goal-conditioned control policy. Due to our explicit functional decomposition, it is the only layer with access to the agent’s internal state including joint positions and velocities. Meanwhile, the higher layers only have access to the agent’s position. In the control tasks considered, an agent has to learn a policy to reach a certain goal position, e.g., reach a target position in a navigation domain. Similar to HAC, we use hindsight goal transition techniques so that the control policy receives rewards even in failure cases.
We evaluate our method in a series of continuous control tasks111See: https://sites.google.com/view/hide-rl. All environments are simulated in the MuJoCo physics engine (Todorov et al., 2012). Experiment and implementation details are provided in the Appendix B. First, in Section 5.1, we compare to various baseline methods. In Section 5.2, we move to an environment with a more complex design in order to show our model’s generalization capabilities. In Section 5.3, we provide an ablation study for our design choices. Finally, Section 5.4 demonstrates that our approach indeed leads to functional decomposition by transferring layers across agents and domains. We introduce the following task configurations (see Figure 5):
Maze Forward The training environment in all experiments. The task is to reach a goal from a fixed start position.
Maze Backward The training maze layout with swapped start and goal positions.
Maze Flipped Mirrored training environment with swapped starting and goal positions.
Maze Random Randomly generated mazes with random start and goal positions, used in the complex maze (5.2), ablation (5.3) and transfer (5.4) experiments.
We always train in the Maze Forward environment. The reward signal during training is constantly -1, unless the agent reaches the given goal (except for HIRO and HIRO-LR, see Section 5.1). We test the agents on the above tasks with fixed starting and fixed goal position. For more details about the environments, we refer to Appendix A. We intend to answer the following two questions: 1) Can our method generalize to unseen test environment layouts? 2) Is it possible to transfer the planning layes between agents? 3) Does task decomposition lead to generalization across domains?
5.1 Simple Maze Navigation
We compare our method to state-of-the-art approaches including HIRO (Nachum et al., 2019), HIRO-LR (Nachum et al., 2019), HAC (Levy et al., 2019) and a modified version of HAC called RelHAC in a simple Maze Forward environment as shown in Figure 4 (a). To ensure fair comparison, we made a number of improvements to the HAC and HIRO implementations. For HAC, we introduced target networks and used the hindsight experience replay technique with the future strategy (Andrychowicz et al., 2017). We observed that oscillations close to the goal position kept HIRO agents from finishing the task successfully. We solved this issue via doubling the distance-threshold for success. HIRO-LR is the closest related work to our method, since it also receives a top-down view image of the environment. Note that both HIRO and HIRO-LR have access to dense negative distance rewards, which is an advantage over HAC and HiDe which only receive a reward when finishing the task.
The modified HAC implementation (RelHAC) uses the same lower and middle layers as HiDe but we do not decouple state-action spaces as is done in HiDe. Instead RelHAC simple reuses the same structure for the middle and top layer. Preliminary experiments using fixed start and fixed goal positions during training for HAC, HIRO and HIRO-LR yielded zero success rates in all cases. Therefore, the baseline models are trained with fixed start and random goal positions, allowing them to receive a reward signal without having to reach the intended goal at the other end of the maze. Contrarily, HiDe is trained with fixed start and fixed goal positions, whereas HiDe-R represents HiDe under the same conditions as the baseline methods. See Table 6 in the Appendix for an overview of all the baseline methods.
All models successfully learned this task as shown in Table 1 (Forward column). HIRO demonstrates slightly better performance, which can be attributed to the fact that it is trained with dense rewards. RelHAC performs worse than HAC due to the pruned state space of each layer and due to the lack of an effective planner.
Table 1 also summarizes the models’ generalization abilities to the unseen Maze Backward and Maze Flipped environments (see Figure 5). While HIRO, HIRO-LR and HAC manage to solve the training environment (Maze Forward) with success rates between 99% and 82%, they suffer from overfitting to the training environment, indicated by the 0% success rates in the unseen test environments. Contrarily, our method is able to achieve 54% and 69% success rates in this generalization task. As expected, training our model with random goal positions (i.e., HiDe-R) yields a more robust model outperforming vanilla HiDe.
In subsequent experiments, we only report the results for our method, since our experiments have shown that the baseline methods cannot solve the training task for the more complex environments which we present next. We hypothesize that the exploration capabilities of these methods are not sufficient to learn the task.
5.2 Complex Maze Navigation
In this experiment, we train an ant and a ball agent (for details please see Appendix A.1) in the Maze Forward task with a more complex environment layout (cf. Figure 5), i.e., we increase the size of the environment and the number of obstacles, thereby also increasing the distance to the final reward. We keep both the start and goal positions intact and evaluate this model in different tasks (see Section 5).
Table 2 reports success rates of both agents in this complex task. Our model successfully learns the training task, showing that it is able to scale to larger environments with longer horizons. The performance in the unseen testing environments Maze Backward and Maze Flipped is similar to the results shown in Section 5.1 despite the increased difficulty. Since the randomly generated mazes are typically easier, our model shows similar or better performance.
5.3 Ablation studies
To support the claim that our architectural design choices lead to better generalization and functional decomposition, we compare empirical results of different variants of our method with the ant agent. First, we compare the performance of relative and absolute positions for the experiment reported in Section 5.2. For this reason, we train HiDe-A and HiDe-AR, the corresponding variants of HiDe and HiDe-R that use absolute positions. Unlike the case of relative positions, the policy needs to learn all values within the range of the environment dimensions in this setting. Second, we compare HiDe against a variant of HiDe without the interface layer called HiDe-NI.
The results for the complex maze task configuration of the ablation experiments are summarized in Table 3. Both HiDe-A and HiDe-AR fail to solve the complex maze tasks. These results indicate that relative positions improve performance and are an important aspect of our method to scale to more complex environments and help generalization to other environment layouts. As seen in Table 3, the variant of HiDe without an interface layer (HiDe-NI) performs worse than HiDe (cf. Table 2) in all experiments. This indicates that the interface layer is an important part of our architecture.
in the Appendix. The learned attention window (HiDe) achieves better or comparable performance. In all cases, HiDe generalizes better in the Maze Backward variant of the complex maze. Moreover, the learned attention eliminates the need for tuning the window size hyperparameter per agent and environment.
5.4 Transfer of Policies
We argue that a key to better generalization behavior in hierarchical RL lies in enforcing a separation of concerns across the different layers. To whether the overall task is really split into separate sub-tasks we perform sequence of experiments that transfer parts of the policy across agents and tasks.
5.4.1 Agent Transfer
We transfer individual layers across different agents to demonstrate that each part of the hierarchy indeed learns different sub-tasks. We first train an agent without our planning layer, i.e., with RelHAC. We then replace the top layer of this agent with the planning layer from the models trained in Section 5.2. Additionally, we train a humanoid agent and show as a proof of concept that transfer to a very complex agent can be achieved.
We carry out two sets of experiments. First, we transfer the ant model’s planning layer to the simpler DoF ball agent. As indicated in Table 4, the performance of the ball with the ant’s planning layer matches the results of the ball trained end-to-end with HiDe (cf. Table 2). The ball agent’s success rate increases for the Maze Random (from to ) and Maze forward ( to ), whereas it decreases slightly in the Maze Backward (from to ) and Maze Flipped (from to ) task configurations.
Second, we attach the ball agent’s planning layer to the more complex ant agent. The newly composed agent performs marginally better or worse in the Maze Flipped, Maze Random and Maze Backward tasks. Please note that this experiment is an example of a case where the environment is first learned with a fast and easy-to-train agent (i.e., ball) and then utilized by a more complex agent. We hereby show that transfer of layers between agents is possible and therefore find our hypothesis to be valid. Moreover, an estimate indicates that the training is roughly 3 – 4 times faster, since the complex agent does not have to learn the planning layer.
To further demonstrate our method’s transfer capabilities, we train a humanoid agent (17 DoF) in an empty environment with shaped rewards. We then use the planning and interface layer from a ball agent and connect it as is with the locomotion layer of the trained humanoid222Videos available at https://sites.google.com/view/hide-rl.
5.4.2 Domain Transfer
In this experiment, we demonstrate the capability of HiDe to transfer the planning layer from a simple ball agent, trained on a pure locomotion task, to a robot manipulation agent. The goal of this experiment is not to compete with state-of-the-art manipulation algorithms, but rather to highlight both our contributions, i.e., i) transfer of the modular layers across agents and domains and ii) generalization to different environment layouts without retraining.
To this end, we train a ball agent with HiDe as described in Section 5.1. Moreover, we train a control policy for a robot manipulation task in the OpenAI Gym ”Push” environment (Brockman et al., 2016), which learns to push a cube-sized object to a robot relative position goal. Note that the manipulation task does not encounter any obstacles during training. To attain the final compound agent, we then attach the planning and interface layer of the ball agent to the goal-conditioned manipulation policy (cf. Figure 1). For testing, we generate 500 random environment layouts similar to the Random task described in Section 5. Similar to the navigation experiments in Section 5.1, state-of-the-art methods are able to solve these tasks when trained on a single environment layout. However, they do not generalize to other layouts without retraining. In contrast our evaluation of the compound HiDe agent on unseen testing layouts shows a success rate of 49. Note that the control layer has never been exposed to obstacles before. Thus, our modular approach can achieve domain transfer and generalize to different environments without retraining2.
In this paper, we introduce a novel HRL architecture that can solve complex control tasks in 3D-based environments. The architecture consists of a planning layer which learns an explicit value map and is connected with a subgoal refinement layer and a low-level control layer. The framework can be trained end-to-end. While training with a fixed starting and goal position, our method is able to generalize to previously unseen settings and environments. Furthermore, we demonstrate that transfer of planners between different agents can be achieved, enabling us to move a planner trained with a simplistic agent to a more complex agent, such as a humanoid or a robot manipulator. The key insight lies in the strict separation of concerns across layers which is achieved via decoupled state-action spaces and restricted access to global information. In the future, we want to extend our method to a 3D-based planning layer connected with a 3D attention mechanism.
- State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, pp. 119–125. Cited by: §1.
- Hindsight Experience Replay. In Advances in Neural Information Processing Systems, pp. 5048–5058. Cited by: 11st item, §B.1, §3.1, §5.1.
- The option-critic architecture. abs/1609.05140. External Links: Cited by: §1, §2.1.
- Dynamic programming and optimal control. 2nd edition, Athena Scientific. External Links: Cited by: §2.2.
- OpenAI gym. External Links: Cited by: §A.1, §B.2, §5.4.2.
- Gradient descent optimization of smoothed information retrieval metrics. Information retrieval 13 (3), pp. 216–235. Cited by: §4.1.
- Feudal reinforcement learning. Advances in Neural Information Processing Systems 5, pp. . External Links: Cited by: §2.1, §2.1.
- Hierarchical reinforcement learning with the MAXQ value function decomposition. cs.LG/9905014. Cited by: §2.1.
Benchmarking Deep Reinforcement Learning for Continuous Control.
International Conference on Machine Learning, pp. 1329–1338. Cited by: §A.1.
- Search on the Replay Buffer: Bridging Planning and Reinforcement Learning. arXiv preprint arXiv:1906.05253. Cited by: §2.2.
- Stochastic neural networks for hierarchical reinforcement learning. abs/1704.03012. External Links: Cited by: §1.
- Meta learning shared hierarchies. abs/1710.09767. External Links: Cited by: §1.
- Cognitive mapping and planning for visual navigation. In , pp. 2616–2625. Cited by: §2.2.
- End-to-end training of deep visuomotor policies. abs/1504.00702. External Links: Cited by: §1.
- Learning Multi-Level Hierarchies with Hindsight. In International Conference on Learning Representations, Cited by: §A.1, Appendix A, 4th item, 9th item, §B.3.2, §1, §2.1, §2.1, §3.2, §4.1, §4.2, §4, §5.1.
- Continuous Control with Deep Reinforcement Learning. arXiv preprint arXiv:1509.02971. Cited by: §1, §4.3.
- Automatic discovery of subgoals in reinforcement learning using diverse density. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, San Francisco, CA, USA, pp. 361–368. External Links: Cited by: §2.1.
- Playing atari with deep reinforcement learning. abs/1312.5602. External Links: Cited by: §1, §4.1, §4.3.
- Near-Optimal Representation Learning for Hierarchical Reinforcement Learning. In International Conference on Learning Representations, Cited by: §1, §2.1, §4.1, §5.1.
- Data-efficient Hierarchical Reinforcement Learning. In Advances in Neural Information Processing Systems, pp. 3303–3313. Cited by: §A.2.1, Appendix A, §1, §2.1, §2.1, §4.1.
- Value Propagation Networks. In International Conference on Learning Representations, Cited by: §B.3.1, §1, §2.2, §3.3, §4.1, §4.1.
- Planning with Goal-Conditioned Policies. In Advances in Neural Information Processing Systems, pp. 14814–14825. Cited by: §2.2.
- Value prediction network. abs/1707.03497. External Links: Cited by: §2.2.
- Reinforcement learning with hierarchies of machines. In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems 10, NIPS ’97, Cambridge, MA, USA, pp. 1043–1049. External Links: Cited by: §2.1.
Automatic differentiation in PyTorch. Cited by: Appendix B.
- Learning to generate subgoals for action sequences. IJCNN-91-Seattle International Joint Conference on Neural Networks ii, pp. 453 vol.2–. Cited by: §2.1.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §1.
- Deterministic Policy Gradient Algorithms. In International Conference on Machine Learning, pp. 387–395. Cited by: §4.1.
- Mastering the game of go without human knowledge. Nature 550, pp. 354–. Cited by: §1.
- Universal planning networks. abs/1804.00645. External Links: Cited by: §2.2, §2.2.
- Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artif. Intell. 112 (1-2), pp. 181–211. External Links: Cited by: §1, §2.1.
- Dyna-style planning with linear function approximation and prioritized sweeping. abs/1206.3285. External Links: Cited by: §2.2.
- Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ML, Cited by: §2.2.
- Value Iteration Networks. In Advances in Neural Information Processing Systems, pp. 2154–2162. Cited by: §2.2, §3.3.
- DeepMind Control Suite. Technical report Vol. abs/1504.04804. Note: https://arxiv.org/abs/1801.00690 Cited by: §A.1.
- Exploiting hierarchy for learning and transfer in kl-regularized RL. CoRR abs/1903.07438. External Links: Cited by: §2.1.
- MuJoCo: a physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: Appendix A, §5.
- FeUdal networks for hierarchical reinforcement learning. abs/1703.01161. External Links: Cited by: §1, §2.1.
- Benchmarking model-based reinforcement learning. abs/1907.02057. External Links: Cited by: §2.2.
Appendix A Environment details
We build on the Mujoco (Todorov et al., 2012) environments used in (Nachum et al., 2018). All environments use . Each episode in experiment 1 is terminated after 500 steps and after 800 steps in the rest of the experiments or after the goal in reached. All rewards are sparse as in (Levy et al., 2019), i.e., for reaching the goal and otherwise. We consider goal reached if . Since HIRO sets the goals in the far distance to encourage the lower layer to move to the goal faster, it oscillates around the target position. Thus for HIRO, we consider a goal reached if .
Our ant agent is equivalent to the one in (Levy et al., 2019). In other words, the ant from Rllab (Duan et al., 2016) with gear power of 16 instead of 150 and 10 frame skip instead of 5. Our ball agent is the PointMass agent from DM Control Suite (Tassa et al., 2018). We changed the joints so that the ball rolls instead of sliding. Furthermore, we resize the motor gear and the ball itself to match the maze size. For the manipulation robot, we slightly adapt the ”Push” task from OpenAI gym (Brockman et al., 2016). The original environment uses an inverse kinematic controller to steer the robot, whereas joint positions are enforced and realistic physics are ignored. This can cause unwanted behavior, such as penetration through objects. Hence, we change the control inputs to motor torques for the joints.
a.2.1 Locomotion mazes
All locomotion mazes are modelled by immovable blocks of size . (Nachum et al., 2018) uses blocks of . The environment shapes are clearly depicted in 5. For the randomly generated maze, we sample each block with probability being empty . The start and goal positions are also sampled randomly at uniform. Mazes where start and goal positions are adjacent or where the goal is not reachable are discarded. For evaluation, we generated 500 of such environments and reused them (one per episode) for all experiments.
a.2.2 Manipulation environments
The manipulation environments differ from the locomotion mazes in scale. Each wall is of size . We used a layout of blocks. The object position was the position used for the interface layer. When the object escaped the top-down view range, the episodes were terminated. The last observation was only added to the control layer with a subgoal penalty. The random layouts were generated using the same methodology as for the locomotion mazes.
Appendix B Implementation Details
b.1 Baseline experiments
For both HIRO and HAC we used the authors’ original implementations444HIRO:https://github.com/tensorflow/models/tree/master/research/efficient-hrl555HAC:https://github.com/andrew-j-levy/Hierarchical-Actor-Critc-HAC-. We ran the hiro_xy variant, which uses only position coordinates for subgoals instead of all joint positions to have a fair comparison with our method. To improve the performance of HAC in experiment one, we modified their Hindsight Experience Replay (Andrychowicz et al., 2017) implementation so that they use Future strategy. More importantly, we also added target networks to both the actor and critic to improve the performance. We used OpenAI’s baselines (baselines) for the DDPG+HER implementation. When pretraining for domain transfer, we made the achieved goals relative before feeding them into the network. For a better overview, see Table 6.
b.2 Evaluation details
For evaluation, we trained 5 seeds each for 2.5M steps on the “Forward” environment with continuous evaluation (every 100 episodes for 100 episodes). After training, we selected the best checkpoint based on the continuous evaluation of each seed. Then, we tested the learned policies for 500 episodes and reported the average success rate. Although the agent and goal positions are fixed, the initial joint positions and velocities are sampled from uniform distribution as is standard in OpenAI Gym environments(Brockman et al., 2016). Therefore, the tables in the results (cf. Section 5
) contain means and standard deviations across 5 seeds.
b.3 Network Structure
b.3.1 Planning layer
Input images for the planning layer were binarized in the following way: each pixel corresponds to one block (0 if it was a wall or 1 if it was a corridor). In our planning layer, we process the input image of size 32x32 (20x20 for experiment 1) via two convolutional layers withkernels. Both layers have only
input and output channel and are padded so that the output size is the same as the input size. We propagate the value through the value map as in(Nardelli et al., 2019) times using a max pooling layer. Finally, the value map and agent position image (a black image with a dot at the agent position) is processed by 3 convolutions with output channels and filter window interleaved by max pool with activation functions and zero padding. The final result is flattened and processed by two fully connected layers with neurons, each producing three outputs: with , and activation functions respectively. The final covariance matrix is given by
so that the matrix is always symmetric and positive definite. For numerical reasons, we multiply by the binarized kernel mask instead of the actual Gaussian densities. We set the values greater than the mean to and the others to zeros. In practice, we use this line:
kernel = t.where(kernel >= kernel.mean(dim=[1,2], keepdim=True), t.ones_like(kernel), t.zeros_like(kernel))
b.3.2 Middle and Locomotion layer
We use the same network architecture for the middle and lower layer as proposed by (Levy et al., 2019), i.e. we use 3 times a fully connected layer with activation function. The locomation layer is activated with , which is then scaled to the action range.
b.3.3 Training parameters
Discount for all agents.
Adam optimizer. Learning rate for all actors and critics.
Soft updates using moving average; for all controllers.
Replay buffer size was designed to store 500 episodes, similarly as in (Levy et al., 2019)
updates after each epoch on each layer, after the replay buffer contained at least 256 transitions.
Batch size 1024.
Rewards 0 and -1 without any normalization.
Subgoal testing (Levy et al., 2019) only for the middle layer.
Observations also were not normalized.
2 HER transitions per transition using the future strategy (Andrychowicz et al., 2017).
Exploration noise: 0.05, 0.01 and 0.1 for the planning, middle and locomotion layer respectively.
b.4 Computational infrastructure
All HiDe, HAC and HIRO experiments were trained on 1 GPU (GTX 1080). OpenAI DDPG+HER baselines were trained on 19 CPUs using the baseline repository (baselines).
Appendix C Additional results
In this section, we present all results collected for this paper including individual runs.
|Random start pos||x||x||x||x||✓||✓|
|Random end pos||x||✓||✓||✓||✓||✓|
Images: If the state space has access to images of the environment.
Random start pos: If the starting position is randomized during training.
Random end pos: If the goal position is randomized during training.
Agent position: If the state space has access to the agent’s position.
Shaped reward: If the algorithm learns using a shaped reward.
EnvGeneral: Whether generalization is possible without retraining.
Agent transfer: Whether transfer of layers between agents is possible without retraining.
|Ant 1||Ant 2||Ant 3||Ant 4||Ant 5||Ball 1||Ball 2||Ball 3||Ball 4||Ball 5|
|AB 1||AB 2||AB 3||AB 4||AB 5||Averaged|
|BA 1||BA 2||BA 3||BA 4||BA 5||Averaged|
|Ant 1||Ant 2||Ant 3||Ant 4||Ant 5||Averaged|
|Ant 1||Ant 2||Ant 3||Ant 4||Ant 5||Averaged|
|Ant 1||Ant 2||Ant 3||Ant 4||Ant 5||Averaged|
|Ant 1||Ant 2||Ant 3||Ant 4||Ant 5||Averaged|
|Ant 1||Ant 2||Ant 3||Ant 4||Ant 5||Averaged|
|Arm 1||Arm 2||Arm 3||Arm 4||Arm 5||Averaged|