Curious Hierarchical Actor-Critic Reinforcement Learning

by   Frank Röder, et al.
University of Hamburg

Hierarchical abstraction and curiosity-driven exploration are two common paradigms in current reinforcement learning approaches to break down difficult problems into a sequence of simpler ones and to overcome reward sparsity. However, there is a lack of approaches that combine these paradigms, and it is currently unknown whether curiosity also helps to perform the hierarchical abstraction. As a novelty and scientific contribution, we tackle this issue and develop a method that combines hierarchical reinforcement learning with curiosity. Herein, we extend a contemporary hierarchical actor-critic approach with a forward model to develop a hierarchical notion of curiosity. We demonstrate in several continuous-space environments that curiosity approximately doubles the learning performance and success rates for most of the investigated benchmarking problems.



There are no comments yet.


page 8


Hierarchical Actor-Critic

We present a novel approach to hierarchical reinforcement learning calle...

Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms

The hierarchical interaction between the actor and critic in actor-criti...

FORK: A Forward-Looking Actor For Model-Free Reinforcement Learning

In this paper, we propose a new type of Actor, named forward-looking Act...

Stay Alive with Many Options: A Reinforcement Learning Approach for Autonomous Navigation

Hierarchical reinforcement learning approaches learn policies based on h...

Integrating Behavior Cloning and Reinforcement Learning for Improved Performance in Sparse Reward Environments

This paper investigates how to efficiently transition and update policie...

Reinforcement Learning for Mixed-Integer Problems Based on MPC

Model Predictive Control has been recently proposed as policy approximat...

A Reinforcement Learning Model Using Neural Networks for Music Sight Reading Learning Problem

Music Sight Reading is a complex process in which when it is occurred in...

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: The CHAC Architecture with two layers of hierarchy. A forward model is employed to compute the prediction error , which provides an additional curiosity-based reward for the layer of hierarchy. This intrinsic reward is added to the extrinsic reward to train the actor-critic.

A general problem for reinforcement learning is sparse rewards. For example, tasks as simple as drinking water involve a complex sequence of motor commands, and only upon completion of this complex sequence, a reward is provided, which destabilizes the learning of value functions. Hierarchical Reinforcement Learning (HRL) partially alleviates this issue by decomposing difficult tasks into simpler subtasks, providing additional intrinsic rewards upon completion of the subtasks. Therefore, HRL is a major step towards human-like cognition [21] and decision-making [3]. There exists a considerable body of research demonstrating that hierarchical architectures provide a significant performance gain compared to non-hierarchical architectures by performing such abstractions [6, 16, 26].

However, HRL does not completely eliminate the problem of reward sparsity. By adding intrinsic rewards for achieving subtasks, it rather transforms the problem of reward sparsity into the problem of selecting the appropriate subgoals or subtasks. Learning the subgoal or subtask-selection still suffers from reward sparsity. So how can we improve the learning of subtask selection under sparse rewards?

Current reinforcement literature offers two commonly used methods for overcoming rewards sparsity that we will investigate to address this question. The first method is hindsight experience replay (HER) [1]. The idea behind HER is to pretend in hindsight that the final state of a rollout was the goal of the rollout, regardless of whether it was actually the original one. This way, unsuccessful rollouts get rewarded by considering in hindsight that they were successful. In recent work, Levy et al. [16] have successfully combined HER with a hierarchical actor-critic reinforcement learning approach, demonstrating a significant performance gain for several continuous-space environments. The second method to densify rewards is curiosity. Existing curiosity-based approaches in non-hierarchical reinforcement learning (e.g. [10, 20]) provide additional rewards when the agent is surprised. Following research around Friston et al. [8], the notion of surprise is based on the prediction error of an agent’s internal forward model. That is, the agent is surprised when its internal prediction of the world dynamics does not coincide with its actual dynamics.

There exists a significant amount of recent approaches on hierarchical reinforcement learning (e.g. [2, 12, 13, 15, 16, 19, 26]). We are also aware of significant recent improvements in curiosity-driven non-hierarchical reinforcement learning (e.g. [5, 7, 10, 11, 20, 27]). However, despite significant evidence from Cognitive Sciences, suggesting that curiosity is a hierarchical phenomenon [21], there exist no functional computational models to verify this hypothesis.

In this paper, we address this lack and ask the following central research question: To what extent can we alleviate reward-sparsity and improve the learning performance of hierarchical actor-critic reinforcement learning with a hierarchical curiosity mechanism?

We address this question by extending the hierarchical actor-critic approach by Levy et al. [16] with a reward signal that fosters the agent’s curiosity. We extend the approach with Friston et al.’s proposal to model surprise based on prediction errors [8] and provide the agent with intrinsic rewards if it surprised (see Figure 1). As a novelty and scientific contribution, we are the first to present a computational model that combines curiosity with hierarchical reinforcement learning, and that considers also hindsight experience replay as an additional method to overcome reward sparsity. We refer to our method as Curious Hierarchical Actor-Critic (CHAC) and evaluate our approach in several continuous-space benchmark environments.

2 Background and Related Work

Our research integrates hierarchical reinforcement learning with a curiosity and surprise mechanism inspired by the principle of active inference [8]. In the following, we provide the background of these mechanisms and methods.

2.1 Reinforcement Learning

Reinforcement learning (RL) involves a Markov Decision Process(MDP) to maximize the long-term expected reward. An MDP is defined as a tuple,

, where is a set of states, is a set of actions, is a reward function,

is a transition probability of reaching state

from the current state when executing action , and

is a discount factor, indicating how much the agent prefers short-term to long-term rewards. In our setting, the agent takes actions drawn from a probability distribution over action, a policy, denoted

. The goal of the agent is to take actions that maximize long term expected reward. In this work, we employ the Deep Deterministic Policy Gradient (DDPG) algorithm [17] for the policy learning. DDPG is a model-free off-policy actor-critic algorithm, which combines the Deterministic Policy Gradient (DPG) algorithm [25] with Deep Q-network (DQN) [18]. This enables agent with DDPG to work in continuous space while learning with large, non-linear function approximators more stably and efficiently. In Section 3 we define how this non-hierarchical notion of reinforcement learning is extended to the hierarchical actor-critic case.

2.2 Curiosity-Driven Exploration

Friston et al. [8] describe surprise as “the improbability of sampling some signals, under a generative model of how those signals were caused.”. Hence, curiosity can be achieved by maximizing surprise, i.e., by maximizing the probability of sampling signals that do not coincide with the predictions by the generative model [4, 8].111Note that curiosity is a broad term and there exist other rich notions of curiosity [9]. However, for this paper we focus on the well-defined and established notion of curiosity as maximizing prediction errors.

A common method realizing this in practical reinforcement learning applications is to define a generative forward model that maps states and actions to successive states. One can then use the forward model to implement surprise as a normalized function of the error between the successive states predicted by the model and the actual successive states. This strategy has been successfully employed in several non-hierarchical reinforcement learning approaches [4, 7, 10, 11, 20, 23, 24, 27].

For example, Pathak et al. [20] propose an Intrinsic Curiosity Module, introducing an additional internal reward that is defined as the squared error of the predictions generated by a forward model. Similarly, Hafez et al. [10] implement surprise as the absolute error of a set of forward models, and Watters et al. [27] use the squared error as a reward signal.

3 Curious Hierarchical Actor-Critic

The hierarchical actor-critic (HAC) approach by Levy et al. [16] has shown great potential in continuous-space experimentation environments. At the same time, there exists extensive research [10, 20] showing how curious agents that strive to maximize their surprise improve their learning performance. In the following, we describe how we combine both paradigms.

3.1 Hierarchical Actor-Critic

Hierarchical actor-critic (HAC) [16] is a framework that enables agents to learn a nested hierarchy of polices. It uses hindsight experience replay (HER) [1] to alleviate reward-sparsity. Each layer of the hierarchy learns to solve a subproblem defined by the spaces and a transition function of its layers below. The highest layer receives as input the current state and the overall extrinsic goal. High-level layers produce actions that are subgoals for the next lower level. The lowest layer produces motor commands that are executable by the agent in the environment.

Formally, we define a hierarchy of layers with each containing an actor-critic network and a replay buffer to store experiences. Here we further expand the RL setting (cf. Section 2.1) for hierarchical agents. A layer is described as a Universal Markov Decision Process (UMDP) with an additional set of goals  [16]. The UMDP, an extension of MDP with universal value function approximator (UVFA) [22], is a tuple containing the state space , the goal space , the action space , the transition probability function , the reward function , and the discount rate for each layer . The state space of each layer is identical to the original, namely . The produced subgoals by the policy of each layer are within , and therefore . The action space is equal to the goal space of the next lower layer, except the lowest one, thus . Only in the lowest layer, we execute the so-called primitive actions of the agent within the environment and therefore have  [16].

HAC involves the following three kinds of state transitions that implement hindsight experience replay (HER) [1] in a hierarchical setting.

3.1.1 Hindsight Goal Transitions

These are the same transitions as in the non-hierarchical HER method: after a rollout has completed, the agent pretends in hindsight that the actually achieved state was the goal state. Computationally, this is implemented by adding state transition samples with modified goals to the replay buffer. It enables the critic function to encounter at least one sparse reward after a sequence of actions. Hindsight goal transitions generalize the approximation to other regions of the state-action space.

3.1.2 Hindsight Action Transitions

When a high-level layer sends a subgoal to a low-level layer, it frequently happens that the low-level layer fails to achieve the subgoal. Once the low-level layer is trained better, it achieves a subgoal more often. This dynamic process slows down the learning of the high-level layer because it constantly needs to adapt to the dynamics of the low-level layer. To alleviate this issue, HAC adds additional state transitions to the replay buffer that simulate an optimal lower-level policy and, therefore, a stable low-level transition function. These additional state transitions are generated by pretending in hindsight, that that subgoal provided as action to the low-level layer has been achieved. Technically, this is implemented by replacing the successor state of the state transition with the goal. With this procedure, HAC can learn multiple levels of policies in parallel, even if the lower-level policies are not yet fully trained.

3.1.3 Subgoal Testing Transitions

Some subgoals may just be impossible or too hard to achieve for a low-level policy. To foster the generation of subgoals that are actually achievable by the low-level layer, HAC frequently tests whether subgoals can be achieved. During these testing phases, exploration is disabled, and a penalty is given when a subgoal could not be reached. Since difficult subgoals are penalized in the beginning of the training, but not anymore when the agent is trained better, subgoal testing provides HAC with a method to automatically generate a curriculum.

3.2 Combining Hierarchical Actor-Critic with Curiosity

To combine HAC with curiosity-based rewards, we implement a forward model based on a multi-layered perceptron that learns to predict the successive state

given the current state and an action at time . Formally, this mapping is given as follows:


with the models parameters . An action produced by a policy of the layer (except the bottom layer, where ) at time is a subgoal for the subsequent level. We implement one forward model per layer. That is, we define a forward model not only for the primitive action in the lowest layer but also for the subgoal action in the higher layers. The learning objective for training the forward model is to minimize the prediction loss, defined as:


Similar to the approach by Pathak et al. [20], the forward model’s error of the layer is used to realize the curiosity-based bonus, denoted as . We calculate the mean-squared-error as follows:


The regular extrinsic rewards (from the environment) are defined in the range of , hence we need to normalize the curiosity reward resulted of Equation 3. The normalization of the curiosity reward is conducted with respect to the maximum and minimum values of the curiosity level in the whole history (stored in a buffer), and respectively, as follows:


In other words, if the prediction error is high, corresponding to high curiosity, the normalized value will be close to , otherwise, it is close to .

The total reward at time that layer receive given the extrinsic reward and the curiosity reward is controlled by the hyper-parameter as follows:


This part is crucial in determining the balance of changing the reward, since if , which is identical to HAC. We further elaborate on the different values of in Section 4.

3.3 Architecture and Training

We implement the forward model (of each hierarchical layer

) as a multilayer perceptron (MLP), receiving the concatenated current state

and action and generating a prediction for the successor state as output (cf. Equation 1). For most experiments in this paper (see Section 4), we use an MLP with 3 hidden layers of size 2048 to learn the forward model from the agent’s experiences. Experimentally, we found that this setting yields the best performance results, except for the UR5 reacher environment where we use a hidden size of 128 (see Figure 1(d)). Following Levy et al. [16], we also realize the actor and critic networks with MLPs of 3 hidden layers of size . Both the forward model and actor-critic are trained concurrently with the learning rate of using ADAM optimizer [14]. After each interaction episode, samples are randomly drawn from the replay buffer for training the network parameters of all components, including the forward model. The hyper-parameters were either adapted from HAC [16] or fine-tuned with preliminary experiments.

4 Experiments

We compare the performance of our framework in several goal-based environments with continuous state and action spaces. All environments provide a sparse extrinsic reward when the goal is reached. To evaluate our approach, we record the learning performance in terms of successful rollouts in relation to training rollouts. Therefore, we alternate training (with exploration using -greedy) and testing rollouts (without exploration) and measure the success rate as the average number of successful testing rollouts withing a testing batch.

4.1 Environments

To evaluate our proposal, we run experiments in following simulated environments:

(a) Ant Reacher
(b) Ant Four Room
(c) Fetch Reacher
(d) UR5 Reacher
Figure 2: Simulated environments for experiments
  • Ant reacher: The ant reacher environment (see Figure 1(a)) consists of a four-legged robotic agent that must learn to walk to reach a target location. The action space is based on the joint angles of the limbs, and the observation space consists of the Cartesian locations and velocities of the body parts of the agent. The target location is random Cartesian coordinates of the agent’s torso. The yellow and pink spheres in the figure indicate the end-goal and subgoal respectively.

  • Ant four rooms: This environment is the same as Ant reacher, except that there are walls in the environments that the agent cannot pass (see Figure 1(b)). The walls form four rooms that are connected by passages to transition from one room to another, increasing the difficulty compared to Ant reacher.

  • Fetch robot reacher: This reacher environment (see Figure 1(c)) is based on an inverse kinematics model that provides a 3D continuous action space. The task of the robot is to move the gripper to a target position (indicated in the figure by the black sphere), defined in terms of Cartesian coordinates.

  • UR5 reacher: This environment consists of the first three DoFs (two shoulder joints and one elbow joint) of a UR5 robotic arm that must reach (feasible) random joint configurations indicated by yellow boxes in Figure 1(d). The action space is determined by the angles of the joints, and the state space consists of joint velocities angles.

4.2 Results

(a) Ant reacher
(b) Ant four rooms
(c) Fetch Reacher
(d) UR5
Figure 3: Learning performance of the four environments

Results from Figure 3 reveal significant performance gains in terms of the learning progress for all investigated environments. We first used the ant environments to determine a reasonable value for and observe that provides better results than . Hence, we used this value also for all reacher environments, i.e. Fetch and UR5.

For the ant environments, the success rates peak at around 0.4 (reacher) and 0.28 (four rooms) when using plain HAC with two layers. Adding the curiosity-based reward leads to success rates that are more than twice as high, namely around 0.95 for reacher and 0.7 for four rooms.222Note that the performance of our HAC implementation in the ant environments is lower as in the original paper by Levy et al. [16] because we used a smaller replay buffer due to memory limitations. Similar performance gains are achieved in the reacher environments: The Fetch reacher remains at a relatively low success rate when using plain HAC, as it cannot achieve success rates higher than 0.05. Using CHAC with improves the performance towards success rates between 0.2 and 0.3. HAC-based UR5 achieves similarly low success rates (less than 0.05), while the CHAC-based UR5 achieves a better performance, up to 0.12.

5 Conclusion

Curiosity and the ability to perform problem-solving in a hierarchical manner are two important features of human-level problem-solving and learning. As a novelty and scientific contribution, this paper presents the first computational approach that combines both features by extending hierarchical actor-critic reinforcement learning with a curiosity-enabled reward function. The level of curiosity is modeled by the prediction error of learnable forward models included in all hierarchical layers. Our experimental results provide significant evidence that curiosity improves hierarchical problem-solving. Specifically, using the success rate as evaluation metrics, we show that curiosity more than doubles the learning performance for the proposed hierarchical architecture and benchmark problems.


  • [1] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba OpenAI (2017) Hindsight Experience Replay. In Conference on Neural Information Processing Systems (NIPS), pp. 5048–5058. External Links: Link Cited by: §1, §3.1, §3.1.
  • [2] P. Bacon, J. Harb, and D. Precup (2017-02) The Option-Critic Architecture. In

    Thirty-First AAAI Conference on Artificial Intelligence

    pp. 1726–1734. External Links: Link Cited by: §1.
  • [3] M. Botvinick and A. Weinstein (2014-09) Model-based hierarchical reinforcement learning and human action control. Philosophical Transactions of the Royal Society B: Biological Sciences. External Links: Link, Document, ISSN 0962-8436 Cited by: §1.
  • [4] M. V. Butz (2016) Toward a Unified Sub-symbolic Computational Theory of Cognition.. Frontiers in psychology 7, pp. 925. External Links: Link, Document, ISSN 1664-1078 Cited by: §2.2, §2.2.
  • [5] C. Colas, P. Fournier, O. Sigaud, M. Chetouani, and P. Oudeyer (2019) CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning. In

    International Conference on Machine Learning (ICML)

    External Links: Link Cited by: §1.
  • [6] M. Eppe, P. D. H. Nguyen, and S. Wermter (2019) From Semantics to Execution: Integrating Action Planning with Reinforcement Learning for Robotic Causal Problem-solving. Frontiers in Robotics and AI 6, pp. online. External Links: Document Cited by: §1.
  • [7] S. Forestier and P. Y. Oudeyer (2016-10) Modular active curiosity-driven discovery of tool use. In IEEE International Conference on Intelligent Robots and Systems, pp. 3965–3972. External Links: Link, ISBN 9781509037629, Document, ISSN 21530866 Cited by: §1, §2.2.
  • [8] K. Friston, J. Mattout, and J. Kilner (2011-02) Action Understanding and Active Inference. Biological Cybernetics 104 (1-2), pp. 137–160. External Links: Link, Document, ISSN 0340-1200 Cited by: §1, §1, §2.2, §2.
  • [9] J. Gottlieb and P. Y. Oudeyer (2018-12) Towards a neuroscience of active sampling and curiosity. Nature Reviews Neuroscience 19 (12), pp. 758–770. External Links: Document, ISSN 14710048 Cited by: footnote 1.
  • [10] M. B. Hafez, C. Weber, and S. Wermter (2017) Curiosity-Driven Exploration Enhances Motor Skills of Continuous Actor-Critic Learner. In International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pp. 39–46. External Links: Link Cited by: §1, §1, §2.2, §2.2, §3.
  • [11] T. Hester and P. Stone (2017) Intrinsically motivated model learning for developing curious robots. Artificial Intelligence 247, pp. 170–86. External Links: Link, ISBN 9781467349635, Document, ISSN 00043702 Cited by: §1, §2.2.
  • [12] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu (2017-11) Reinforcement Learning with Unsupervised Auxiliary Tasks. International Conference on Learning Representations (ICLR) abs/1611.0, pp. online. External Links: Link Cited by: §1.
  • [13] Y. Jiang, S. (. Gu, K. P. Murphy, and C. Finn (2019) Language as an Abstraction for Hierarchical Deep Reinforcement Learning. In Neural Information Processing Systems (NeurIPS), pp. 9419–9431. External Links: Link Cited by: §1.
  • [14] D. P. Kingma and J. L. Ba (2015-12) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, External Links: Link Cited by: §3.3.
  • [15] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. B. Tenenbaum (2016) Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. Conference on Neural Information Processing Systems (NIPS), pp. 3675–3683. External Links: Link, ISBN 0924-6703, Document, ISSN 1573-7594 Cited by: §1.
  • [16] A. Levy, G. Konidaris, R. Platt, and K. Saenko (2019) Learning Multi-Level Hierarchies with Hindsight. In International Conference on Learning Representations (ICLR), pp. online. External Links: Link Cited by: §1, §1, §1, §1, §3.1, §3.1, §3.3, §3, footnote 2.
  • [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous Control with Deep Reinforcement Learning. In International Conference on Learning Representations, External Links: Link Cited by: §2.1.
  • [18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015-02) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: Link, ISBN 1476-4687 (Electronic) 0028-0836 (Linking), Document, ISSN 0028-0836 Cited by: §2.1.
  • [19] O. Nachum, S. (. Gu, H. Lee, and S. Levine (2018) Data-Efficient Hierarchical Reinforcement Learning. In Conference on Neural Information Processing Systems (NeurIPS), pp. 3303–3313. External Links: Link Cited by: §1.
  • [20] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven Exploration by Self-supervised Prediction. In International Conference on Machine Learning (ICML), pp. 2778–2787. External Links: Link Cited by: §1, §1, §2.2, §2.2, §3.2, §3.
  • [21] G. Pezzulo, F. Rigoli, and K. J. Friston (2018-04) Hierarchical Active Inference: A Theory of Motivated Control. Vol. 22, Elsevier Ltd. External Links: Document, ISSN 1879307X Cited by: §1, §1.
  • [22] T. Schaul, D. Horgan, K. Gregor, and D. Silver (2015) Universal Value Function Approximators. In International Conference on Machine Learning (ICML), Vol. 37, pp. 1312–1320. External Links: Link, ISBN 9781538660263, Document, ISSN 10504729 Cited by: §3.1.
  • [23] G. Schillaci, V. V. Hafner, and B. Lara (2016) Exploration Behaviors, Body Representations, and Simulation Processes for the Development of Cognition in Artificial Agents. Frontiers in Robotics and AI 3, pp. 39. External Links: Link, ISBN 2296-9144, Document, ISSN 2296-9144 Cited by: §2.2.
  • [24] J. Schmidhuber (2010-09) Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010). IEEE Transactions on Autonomous Mental Development 2 (3), pp. 230–247. External Links: Link, Document, ISSN 1943-0604 Cited by: §2.2.
  • [25] D. Silver, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic Policy Gradient Algorithms. In International Conference on Machine Learning (ICML), pp. 387–395. Cited by: §2.1.
  • [26] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu (2017) FeUdal Networks for Hierarchical Reinforcement Learning. External Links: Link, ISBN 9781510855144, ISSN 1938-7228 Cited by: §1, §1.
  • [27] N. Watters, L. Matthey, M. Bosnjak, C. P. Burgess, and A. Lerchner (2019-05) COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration. In International Conference on Machine Learning (ICML), External Links: Link Cited by: §1, §2.2, §2.2.