From semantics to execution: Integrating action planning with reinforcement learning for robotic tool use

by   Manfred Eppe, et al.

Reinforcement learning is an appropriate and successful method to robustly perform low-level robot control under noisy conditions. Symbolic action planning is useful to resolve causal dependencies and to break a causally complex problem down into a sequence of simpler high-level actions. A problem with the integration of both approaches is that action planning is based on discrete high-level action- and state spaces, whereas reinforcement learning is usually driven by a continuous reward function. However, recent advances in reinforcement learning, specifically, universal value function approximators and hindsight experience replay, have focused on goal-independent methods based on sparse rewards. In this article, we build on these novel methods to facilitate the integration of action planning with reinforcement learning by exploiting the reward-sparsity as a bridge between the high-level and low-level state- and control spaces. As a result, we demonstrate that the integrated neuro-symbolic method is able to solve object manipulation problems that involve tool use and non-trivial causal dependencies under noisy conditions, exploiting both data and knowledge.



There are no comments yet.


page 1

page 3

page 6

page 12

page 13


Compositional Reinforcement Learning from Logical Specifications

We study the problem of learning control policies for complex tasks give...

Interactive Hierarchical Guidance using Language

Reinforcement learning has been successful in many tasks ranging from ro...

Transferable Task Execution from Pixels through Deep Planning Domain Learning

While robots can learn models to solve many manipulation tasks from raw ...

Learning Robotic Manipulation Tasks through Visual Planning

Multi-step manipulation tasks in unstructured environments are extremely...

Continuous Motion Planning with Temporal Logic Specifications using Deep Neural Networks

In this paper, we propose a model-free reinforcement learning method to ...

Lidar-based exploration and discretization for mobile robot planning

In robotic applications, the control, and actuation deal with a continuo...

Levels of Integration between Low-Level Reasoning and Task Planning

We provide a systematic analysis of levels of integration between discre...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Keywords:

reinforcement learning, hierarchical architecture, planning, robotics, neural networks, causal puzzles

2 Introduction

How can realize robots that reason about complex physical object manipulation problems, and how can we integrate this reasoning with the noisy sensorimotor machinery that executes the required actions in a continuous low-level action space? To address these research questions, we consider reinforcement learning (RL) as it is a successful method to facilitate low-level robot control (Deisenroth2011_PILCO). However, it is well known that non-hierarchical reinforcement-learning architectures fail in situations involving non-trivial causal dependencies that require the reasoning over an extended time horizon (Mnih2015). For example, the robot in Fig. 1 (right) needs to first grasp the rake before it can be used to drag the block to a target location. Such a problem is hard to solve by RL-based low-level motion planning without any high-level method that subdivides the problem into smaller sub-tasks.

To this end, recent research has developed hierarchical and model-based reinforcement learning methods to tackle problems that require reasoning over a long time horizon, as demanded in the domains of robotic tool use, block stacking (Deisenroth2011_PILCO), and computer games (Aytar2018_DQN_Atari_Youtube; Pohlen2018_Atari_Consistent). Yet, the problem of realizing an agent that learns to solve open-domain continuous space causal puzzles from scratch, without learning from demonstration or other data sources, remains unsolved. The existing learning-based approaches are either very constrained (e.g. (Deisenroth2011_PILCO)), or they have been applied only to low-dimensional non-noisy control problems that do not involve complex causal dependencies (e.g. (Bacon2016_OptionCritic_HRL; Levy2019_Hierarchical)), or they build on learning from demonstration (Aytar2018_DQN_Atari_Youtube).

A complementary method to address complex causal dependencies is to use pre-specified semantic domain knowledge, e.g. in the form of an action planning domain description (Mcdermott1998_PDDL). With an appropriate problem description, a planner can provide a sequence of solvable sub-tasks in a discrete high-level action space. However, the problem with this semantic and symbolic task planning approach is that the high-level actions generated by the planner require the grounding in a low-level motion execution layer to consider the context of the current low-level state. For example, executing a high-level robotic action move_object_to_target requires precise information (e.g. the location and the shape) of the object to move it to the target location along a context-specific path. This grounding problem consists of two sub-problems 1 and 2 that we tackle in this article.

  1. [leftmargin=2.5em,labelsep=1em,label=P.0]

  2. The first grounding sub-problem is to map the discrete symbolic action space to context-specific subgoals in the continuous state space. For instance, move_object_to_target needs to be associated with a continuous sub-goal that specifies the desired metric target coordinates of the object.

  3. The second grounding sub-problem is to map the subgoals to low-level context-specific action trajectories. For instance, the low-level trajectory for move_object_to_target is specific to the continuous start- and target location of the object, and to potential obstacles between start and target.

In this research, we address 1 by providing a simple, yet principled, formalism to map the propositional high-level state space to continuous subgoals in Sec. 4.1. We address 2 by integrating this formalism with goal-independent reinforcement learning based on sparse rewards in Sec. 4.2.

Figure 1: A robot performing two object manipulation tasks. 1. Block stacking (left): The gripper must stack three blocks at a random location within the robot’s range on the table (indicated by the transparent goal markers behind the gripper). Herein, the robot needs to subdivide the course of actions into separate actions for grasping and placing the individual blocks. 2. Tool use (right): The red block is out of the gripper’s range (indicated by the dark brown ellipsoid), so that solving the task of moving the block to a target location requires the robot to break the problem down into a sequence of high-level actions that involve grasping the rake, moving the rake towards the block and pulling the rake.

Existing approaches that integrate action planning with reinforcement learning have not been able to map subgoals to low-level motion trajectories for realistic continuous-space robotic applications (Grounds2005_Plan-Q; Ma2009_PolicySearch_Planning)

because they rely on a continuous dense reward signal that is proportional to a manually defined metrics that estimates how well a problem has been solved 

(Ng1999_RewardShaping). The manual definition of such metrics, also known as reward shaping, is a non-trivial problem itself because the semantic distance to a continuous goal is often not proportional to the metric distance.

Recently, so-called universal value function approximators (UVFAs) (Schaul2015_UVFA) in combination with hindsight experience replay (HER) (Andrychowicz2017HindsightReplay) and neural actor-critic reinforcement learning methods (Lillicrap2016_DDPG) have been proposed to alleviate this issue. HER realizes an efficient off-policy algorithm that allows for non-continuous sparse rewards without relying on reward shaping. Specifically, HER treats action trajectories as successful that do not achieve the desired specified goal, by pretending in hindsight that the achieved state was the desired goal state. Our research builds on this method because the sparse subgoal-specific rewards allow us to decouple the reward mechanism from the high-level action planning.

This approach enables us to address the following central hypotheses:

  1. [leftmargin=2.5em,labelsep=1em,label=H.0]

  2. We hypothesize that reinforcement learning with sparse rewards is appropriate to learn the grounding of a discrete symbolic action space in continuous action trajectories.

  3. We hypothesize that the resulting hybrid architecture is capable of solving complex causal object-manipulation puzzles that involve tool use and non-trivial causal dependencies.

  4. We hypothesize that the approach is robust enough to consider a realistic amount of noise in the robot’s observation and action execution.

We address these hypotheses by developing two simulated robotic benchmark environments that are based on the OpenAI Gym framework. For these environments, we provide manually defined action planning domain descriptions, and combine a planner with a reinforcement learner to learn the grounding of high-level action descriptions in low-level trajectories.

Our research contribution is a principled method and proof-of-concept to ground high-level semantic actions in low-level sensorimotor motion trajectories and to integrate reinforcement learning with symbolic action planning.

Herein, we do not focus on achieving good benchmark results in comparison with other approaches. An empirical comparison is also not straight-forward, because most existing approaches are either limited to discrete action- and state spaces or they rely on other non-trivial forms of domain-knowledge. Our research goal is rather to provide a proof-of-concept and a baseline for the integration of action planning with reinforcement learning in continuous domains that involve complex causal dependencies.

The remainder of the article is organized as follows. In Sec. 3 we investigate the state of the art in task and motion planning, hierarchical learning and the integration of planning with learning. We identify the problem of grounding high-level actions in low-level trajectories as a critical issue for robots to solve causal puzzles. We present our method and the underlying background in Sec. 4. We describe the realization of our experiments in Sec. 5, show the experimental results in Sec. 6, before we discuss and align our findings with the hypotheses 1-3 in Sec. 7. We conclude in Sec. 8.

3 State of the art

Our work is related to robotic task and motion planning, but it also addresses plan execution. Therefore, it is also related to hierarchical learning algorithms and the integration of learning with planning.

3.1 Combined task and motion planning

The field of combined task and motion planning (TAMP) investigates methods to integrate the low-level motion planning with high-level task planning. The field aims at solving physical puzzles and problems that are too complex to solve with motion planning alone, often inspired by smart animal behavior (Toussaint2018_DiffPhysics_ToolUse). For example, crows are able to perform a sequence of high-level actions, using tools like sticks, hooks or strings, to solve a puzzle that eventually leads to a reward (Taylor2009). A set of related benchmark problems has recently been proposed by Lagriffoul2018_TAMP_Benchmarks. However, since TAMP focuses primarily on the planning aspects and not necessarily on the online action execution, the benchmark environments do not consider a physical action execution layer.

Toussaint2018_DiffPhysics_ToolUse formulate the TAMP problem as an inverted differentiable physics simulator. The authors consider the local optima of the possible physical interactions by extending mixed-integer programs (MIP) (Deits2014_MIP) to first-order logic. The authors define physical interactions as action primitives that are grounded in contact switches. The space of possible interactions is restricted to consider only those interactions that are useful for the specific problem to solve. These interactions are formulated based on a fixed set of predicates and action primitives in the domain of robotic tool use and object manipulation. However, the authors provide only a theoretical framework for planning, and they do not consider the physical execution of actions. Therefore, an empirical evaluation to measure the actual performance of their framework, considering also real-world issues like sensorimotor noise, is not possible.

Other TAMP approaches include the work by Alili2010_SymGeoReasoning and DeSilva2013_HTN_GeoPlanning who both combine a hierarchical symbolic reasoner with a geometrical reasoner to plan human-robot handovers of objects. Both approaches consider only the planning, not the actual execution of the actions. The authors do not provide an empirical evaluation in a simulated or real environment. Srivastava2014_TAMP_Planner_Interface also consider action execution and address the problem of grounding high-level tasks in low-level motion trajectories by proposing a planner-independent interface layer for TAMP that builds on symbolic references to continuous values. Specifically, they propose to define symbolic actions and predicates such that they refer to certain objects and their poses. They leave it to the low-level motion planner to resolve the references. Their approach scales well on the planning level in very cluttered scenes, i.e., the authors demonstrate that the planning approach can solve problems with 40 objects. The authors also present a physical demonstrator using a PR2 robot, but they do not provide a principled empirical evaluation to measure the success of the action execution under realistic or simulated physical conditions. Wang2018_ModelLearningSampling_TAMP also consider action execution, though only in a simple 2D environment without realistic physics. Their focus is on solving long-horizon task planning problems that involve sequences of 10 or more action primitives. To this end, the authors present a method that learns the conditions and effects of high-level action operators in a kitchen environment.

A realistic model that also considers physical execution has been proposed by Leidner2018. The authors build on geometric models and a particle distribution model to plan goal-oriented wiping motions. Their architecture involves low-level and high-level inference, and a physical robotic demonstrator. However, the authors build on geometric modeling and knowledge only, without providing a learning component. Noisy sensing is also not addressed in their work.

3.2 Hierarchical learning-based approaches

Most TAMP approaches consider the planning as an offline process given geometrical, algebraic or logical domain models. This, however, does not necessarily imply the consideration of action execution. The consideration of only the planning problem under idealized conditions is not appropriate in practical robotic applications that often suffer from sensorimotor noise. To this end, researchers have investigated hierarchical learning-based approaches that differ conceptually from our work because they build on data instead of domain-knowledge to realize the high-level control framework.

For example, Levy2019_Hierarchical and Nachum2018_HIRO both consider ant-maze problems in a continuous state- and action space. The challenge of these problems lies in coordinating the low-level walking behavior of the ant-like agent with the high-level navigation. However, in their ant-maze setup there exist no causally complex dependencies other than the relatively simple ones involved in the high-level navigation.

Other existing hierarchical learning-based approaches are limited to discrete action- or state spaces. For example, Kulkarni2016 presents the h-DQN framework to integrate hierarchical action-value functions with goal-driven intrinsically motivated deep RL. Here, the bottom-level reward needs to be hand-crafted using prior knowledge of applications. Vezhnevets2017_Feudal introduce the FeUdal Networks (FuNs), a two-layer hierarchical agent. The authors train subgoal embeddings to achieve a significant performance in the context of Atari games with a discrete action space. Another example of an approach that builds on discrete actions is the option-critic architecture by Bacon2016_OptionCritic_HRL. Their method extends gradient computations of intra-option policies and termination functions to enable learning options that maximize the expected return within the option framework, proposed by sutton_between_1999. The authors apply and evaluate their framework in the Atari gaming domain.

3.3 Integrating learning and planning

There exist several approaches that exploit domain knowledge and avoid a purely data-driven approach as in hierarchical learning. These approaches build on the same principle as ours, namely on the integration of reinforcement learning with action planning.

For example, the Dyna Architecture (Sutton1991_dyna) and its derived methods, e.g. Dyna-Q (sutton_integrated_1990), queue-Dyna (peng_efficient_1993), RTP-Q (zhao_rtp-q:_1999), aim to speed up the learning procedure of the agent by unifying reinforcement learning and incremental planning within a single framework. While the RL component aims to construct the action model as well as to improve the value function and policy directly through real experiences from environment interaction, the planning component updates the value function with simulated experiences collected from the action model. The authors show that instead of selecting uniformly experienced state-action pairs during the planning, it is much more efficient to focus on pairs leading to the goal state (or nearby states) because these cause larger changes in value function. This is the main idea of the prioritized sweeping method (Moore1993PrioritizedTime) and derivatives (Andre1998GeneralizedSweeping). The methods based on Dyna and prioritized sweeping have not been demonstrated to address sparse rewards, which are required to map discrete high-level actions and states to low-level subgoal states in a scalable manner.

Ma2009_PolicySearch_Planning present the policy search planning method, in which they extend the policy search GPOMDP (Baxter2001Infinite-horizonEstimation) towards the multi-agent domain of robotic soccer. Herein, they map symbolic plans to policies using an expert knowledge database. The approach does not consider tool use or similar causally complex problems. A similar restriction pertains to PLANQ-learning framework (Grounds2005_Plan-Q): The authors combine a Q-learner with a high-level STRIPS planner (Fikes1972_STRIPS), where the symbolic planner shapes the reward function to guide the learners to the desired policy. First, the planner generates a sequence of operators to solve the problem from the problem description. Then, each of these operators is learned successively by the corresponding Q-learner. This discrete learning approach, however, has not been demonstrated to be applicable beyond toy problems, such as the grid world domain that the authors utilize for demonstrations in their paper. Yamamoto2018_Learning_AbductivePlanning propose a hierarchical architecture that uses an abduction-based planner at high-level to generate subgoals for the low-level on-policy reinforcement learning component, which employed the proximal policy optimization (PPO) algorithm (Schulman2017ProximalAlgorithms). This approach requires the introduction of an additional task-specific evaluation function, alongside the basic evaluation function of the abduction model, that allows the planner to provide the learner the intrinsic rewards, similar to (Kulkarni2016). The evaluation is only conducted in a grid-based virtual world where an agent has to pick up materials, craft objects and reach a goal position.

A very interesting approach has been presented by Ugur2015. The authors integrate learning with planning, but instead of manually defining the planning domain description they learn it from observations. To this end, they perform clustering mechanisms to categorize object affordances and high-level effect of actions, which are immediately employed in a planning domain description. The authors demonstrate their work on a physical robot. In contrast to our work, however, the authors focus mostly on the high-level inference and not on the robustness that low-level reinforcement-based architectures provide.

4 Integrating reinforcement learning with action planning

Figure 2: Our proposed integration model. Low-level motion planning elements are indicated in green and high-level elements in orange color. The abstraction functions map the low-level state and goal representations to high-level state and goal representations . These are given as input to the planner to compute a high-level subgoal state . The subgoal grounding function maps to a low-level subgoal under consideration of the context provided by the current low-level state and the low-level goal . The reinforcement learner learns to produce a low-level motion plan that consists of actions based on the low-level subgoal and the low-level state .

To tackle the problems 1 and 2, and to address the research goal of grounding high-level actions in low-level control trajectories, we develop the architecture depicted in Fig. 2. We formalize the abstraction of the low-level space to the high-level space, the planner and the grounding of the high-level representations to low-level subgoals in Sec. 4.1. We present the adopted reinforcement learner in Sec. 4.2, and describe the integration of both methods in Sec. 4.3.

4.1 Abstraction and grounding of states and goals

Our abstraction and grounding mechanism tackles the research problem 1, i.e, the mapping from high-level actions to low-level subgoals. STRIPS-based action descriptions are defined in terms of state changes based on predicates for pre- and postconditions. To appreciate that the state change is determined by the postcondition predicates, and not by the actions themselves, it is more succinct to define subgoals in terms of postcondition predicates because multiple actions may involve the same postconditions. Therefore, we define a grounding function for subgoals . The function is based on predicates instead of actions to avoid redundancy and to minimize the hand-engineering of domain models and background knowledge.

To abstract from low-level perception to high-level world states, we define abstraction functions . These functions do not require any additional background knowledge because they fully exploit the definition of . In our formalization of the abstraction and grounding, we consider the following conventions and assumptions 1-7:

  1. [leftmargin=2.5em,labelsep=1em,label=C.0]

  2. The low-level environment state space

    is fully observable, but observations may be noisy. We represent low-level environment states with finite-dimensional vectors


  3. The low-level action space is determined by continuous finite-dimensional vectors . For example, the experiments described in this paper consider a four-dimensional action space that consists of the normalized relative three-dimensional movement of the robot’s gripper in Cartesian coordinates plus a scalar value to represent the opening angle of the gripper’s fingers.

  4. Each predicate of the planning domain description determines a Boolean property of one object in the environment. The set of all predicates is denoted as . The high-level world state is defined as the conjunction of all positive or negated predicates.

  5. The environment configuration is fully determined by a set of objects whose properties can be described by the set of high-level predicates . Each predicate can be mapped to a continuous finite-dimensional low-level substate vector . For example, the location property and the velocity property of an object in Cartesian space are both fully determined by a three-dimensional continuous vector.

  6. For each predicate , there exists a sequence of indices that determines the indices of the low-level environment state vector that determines the property described by . For example, given that refers to the object being at a specific location, and given that the first three values of determine the Cartesian location of an object, we have that . A requirement is that the indices of the individual predicates must not overlap, i.e., abusing set notation: (see Algorithm 1 for details).

  7. The high-level action space consists of a set of grounded STRIPS operators (Fikes1972_STRIPS) that are determined by a conjunction of precondition literals and a conjunction of effect literals.

  8. A low-level goal is the subset of the low-level state , indicated by the indices , i.e., . For example, consider that the low-level state is a six-dimensional vector where the first three elements represent the location and the last three elements represent the velocity of an object in Cartesian space. Then, given that , we have that refers to the location of the object.

4.1.1 Grounding high-level representations in low-level subgoals

Abstracting from observations to high-level predicate specifications is achieved by mapping the low-level observation space to a high-level conceptual space. This is realized with a set of functions that we define manually for each predicate . For a given predicate , the function generates the low-level substate that determines , based on the current state and the goal :


The function that maps the high-level subgoal state in the context of to a a low-level subgoal builds on Eq. (1), as described with the following Algorithm 1:

3:     subgoal_changed True
4:     while subgoal_changed do
6:         for  do For each predicate in high-level goal state, set low-level subgoal indices
7:              , where          
8:         subgoal_changed      
Algorithm 1 Mapping propositional state representations to continuous state representations

The while loop is necessary to prevent the situation where the application of (in line 7, Algorithm 1) changes in a manner that affects a previous predicate subgoal function. For example, consider the two predicates (at_target o1) and (on o2 o1). The predicate (at_target o1) determines the Cartesian location of o1, and (on o2 o1) depends on these coordinates. Therefore, it may happen that causes the first three elements of to be . However, depends on these to determine the that encode the location of o2 in . The while loop assures that is applied at least once after to consider this dependency. This assures that all dependencies between the elements of are resolved.

To guarantee the termination of the Algorithm 1, i.e., to avoid that the alternating changes of cause an infinite loop, the indices must be constrained in such a way that they do not overlap (see assumption 5).

4.1.2 Abstracting from low-level state- and goal representations to propositional statements

To realize the abstraction from low-level to high-level representations, we define a set of functions in the form of Eq. 2. Specifically, for each predicate , we define the following function maps the current low-level state and the low-level goal to the predicates’ truth values.


Eq. 2 examines whether the subgoal that corresponds to a specific predicate is true, given the current observation and a threshold value for the coordinates . In this article, but without any loss of generality, we assume that each predicate is determined by three coordinates. Eq. 2 computes the difference between these coordinates given the current state, and the coordinates determined by as their Euclidean distance. For example, may define target coordinates for o1, and Eq. 2 considers the distance between the target coordinates and the current coordinates.

4.2 Generation of adaptive low-level action trajectories

To address the research problem 2, i.e, the grounding of actions and subgoals in low-level action trajectories, we consider continuous goal-independent reinforcement learning approaches (Lillicrap2016_DDPG; Schaul2015_UVFA). Most reinforcement learning approaches build on manually defined reward functions based on a metric that is specific to a single global goal, such as the body height, posture and forward speed of a robot that learns to walk (Schulman2015a). Goal-independent reinforcement learning settings do not require such reward shaping (c.f. (Ng1999_RewardShaping)), as they allow one to parameterize learned policies and value functions with goals. We employ the actor-critic deep deterministic policy gradient (DDPG) (Lillicrap2016_DDPG) approach in combination with hindsight experience replay (HER) (Andrychowicz2017HindsightReplay) to realize the goal-independent RL part in our framework. Using the HER technique with off-policy reinforcement learning algorithms (like DDPG) increases the efficiency of sampling for our approach since HER stores not only the experienced episode with the original goal () in the replay buffer. It also stores modified versions of an episode where the goal is retrospectively set to a state that has been achieved during the episode, i.e., with for some .

To realize this actor-critic architecture, we provide one fully connected neural network for the actor , determined by the parameters and one fully connected neural network for the critic , determined by the parameters . The input to both networks is the concatenation of the low-level state and the low-level subgoal . The optimization criterion for the actor is to minimize the value provided by the critic. The optimization criterion for the critic is to minimize the mean squared error between the critic’s output and the discounted expected reward according to the Bellmann equation for deterministic policies, as described by Eq. 3.


Given that the action space is continuous -dimensional, the observation space is continuous -dimensional, and the goal space is continuous -dimensional with , the following holds for our theoretical framework: At each step , the agent executes an action given an observation and a goal , according to a policy

that maps from the current observation and goal to a probability distribution over actions. An action generates a reward

if the goal is achieved at time , otherwise the reward is . To decide whether a goal has been achieved, a function is defined that maps the observation space to the goal space, and the goal is considered to be achieved if for a small threshold . This sharp distinction of whether or not a goal has been achieved based on a distance threshold causes the reward to be sparse and renders shaping the reward with a hand-coded reward function unnecessary.

4.3 Integration of high-level planning with reinforcement learning

Our architecture integrates the high-level planner and the low-level reinforcement learner as depicted in Fig. 2. The input to our framework is a low-level goal . The sensor data that represents the environment state is abstracted together with to a propositional high-level description of the world state and goal . An action planner based on the planning domain definition language (PDDL) Mcdermott1998_PDDL takes these high-level representations as input and computes a high-level plan based on manually defined action definitions (c.f. the Appendix in Sec. 9.1 for examples). We have implemented the caching of plans to accelerate the runtime performance. The high-level subgoal state is the expected successor state of the current state given that the first action of the plan is executed. This successor state is used as a basis to compute the next subgoal. To this end, is processed by the subgoal grounding function (Algorithm 1) that generates a subgoal as input to the low-level reinforcement learner.

5 Experiments

This section describes two designed experiments for evaluation of the proposed approach that we refer to as block stacking and tool use. Both experiments are conducted with a Fetch robot arm in the Mujoco simulation environment (Todorov2012MuJoCo:Control)

. We use a three-layer fully connected neural network with the rectified linear unit (ReLU) activation function to represent the actor-critic network of the reinforcement learner in both experiments. The learning rate was set to 0.01, and the parameters were updated using a parallelized version of the Adam optimizer 

(Kingma2015_Adam). We used a reward discount of , where is the number of low-level actions per rollout. For the block stacking we used 50, 100 and 150 low-level actions for the case of one, two and three blocks respectively. For the tool use experiment we used 100 low-level actions.

Preliminary hyperoptimization experiments showed that the optimal number of units for each layer of the neural networks for actor and critic of the reinforcement learning depends on the observation space. Therefore, we implement the network architecture such that the number of units in each layer scales with the size of the observation vector. Specifically, the layers in the actor and critic consist of 12 units per element in the observation vector. For example, for the case of the block stacking experiment with two blocks, this results in 336 neural units per layer (see Sec. 5.1). We apply the same training strategy of HER (Andrychowicz2017HindsightReplay), evaluate periodically learned policies during training without action noise and use early stopping at between 80% and 90% success rate, depending on the task.

To evaluate the robustness to observational noise, we consider the amount of noise relative to the value range of the observation vector. To this end, we approximate the continuous-valued state range, denoted

, as the difference between the upper and lower quartile of the elements in the history of the last 5000 continuous-valued state vectors that were generated during the rollouts.


We use the upper and lower quartile and not the minimum and maximum of all elements to eliminate outliers.

For each observation step, we randomly sample noise, denoted

, according to a normal distribution with

being the standard deviation. To parameterize the amount of noise added to the observation, we define a noise-to-signal-ratio

such that the noise added to the state vector is computed as . We refer to the noise level as the percentage corresponding to . That is, e.g., is equivalent to a noise level of 1%.

We trained the agent on multiple CPUs in parallel and averaged the neural network weights of all CPU instances after each epoch, as described by

Andrychowicz2017HindsightReplay. An epoch consists of 100 training rollouts per CPU, followed by training the neural networks for actor and critic with 15 batches after each epoch, using a batch size of 256. For the experiment of stacking three blocks we used 26 CPUs, and for all other experiments we used 15 CPUs. The results in Fig. 3 and Fig. 5 illustrate the median and the upper and lower quartile over multiple repetitions of each experiment.

5.1 Block stacking

Fig. 1 (left) presents the simulated environment for this experiment, where a number of blocks (i.e. up to three) are placed randomly on the table. The task of the robot is to learn how to reach, grasp and stack those blocks one-by-one to their corresponding random target location. The task is considered completed when robot successfully places the last block on top of the others and moves its gripper to another random target location. The difficulty of this task increases with the number of blocks to stack.

The size of the goal space depends on the number of blocks. For this experiment, the goal space is a subset of the state space that is constituted by the three Cartesian coordinates of the robot’s gripper and three coordinates for each block. That is, the dimension of the goal- and subgoal space is , where is the number of objects.

The state space of the reinforcement learning agent consists of the Cartesian location and velocity of the robot’s gripper, the gripper’s opening angle, and the Cartesian location, rotation, velocity, and rotational velocity of each object. That is, the size of the state vector is , where is the number of blocks.

For the experiments with one and two blocks, we consider problems where the final location of the gripper is randomly determined. For the experiment with three blocks, we had to simplify this aspect and placed the final location of the gripper immediately above the last block that it has stacked, otherwise the algorithm did not provide acceptable results in reasonable time. The planning domain description for this block stacking environment is implemented with the PDDL actions and predicates provided in the Appendix (in Sec. 9.1.1).

5.2 Tool use

The environment utilized for this experiment is shown in the Fig. 1 (right). A single block is placed randomly on the table, such that it is outside the reachable region of the Fetch robot. The robot has to move the block to a random target position within the reachable (dark brown) region on the table surface. In order to do so, the robot has to learn how to use the provided rake, which is reachable by its gripper. It can drag the block either with the left or the right edge of the rake. The observation space consists of the Cartesian velocities, rotations and locations of the robot’s gripper, the tip the rake, and the object. An additional approximation of the end of the rake is added in this task. The goal space only contains of the Cartesian coordinates of the robot’s gripper, the tip of the rake, and the object. The planning domain description for this tool use environment is implemented with the PDDL actions and predicates provided in the Appendix (in Sec. 9.1.2).

6 Results

To evaluate our approach, we investigate the success rate of the testing phase over time for all experiments, given varying noise levels . The success rate is computed per epoch, by averaging over the number of successful rollouts per total rollouts over ten problem instances per CPU.

6.1 Block stacking

Figure 3: Results of the block stacking experiments for one (top), two (middle) and three (bottom) blocks. We added sensory noise for the case of one and two blocks.

For the experiment with one block, the approach converges after around ten epochs, even with , i.e., if up to 6% of noise is added to the observations. The results are significantly worse for 8% and more noise. For the experiment with two blocks, the performance drops already for 4% noise. Interestingly, for both one and two blocks, the convergence is slightly faster when 1% of noise is added, compared to no noise at all. However, we were not able to conduct experiments with noise for that case of stacking three blocks.

For this latter case of learning to stack three blocks consider also Fig. 4, which shows how many subgoals have been achieved in average during each epoch. Ideally, six high-level actions, and hence six sugboals, are required to solve the task: [move_gripper_to(o1), move_to_target(o1), move_gripper_to(o2), move_o_on_o(o2,o1), move_gripper_to(o3), move_o_on_o(o3,o2)]. The result shows that the agent can learn to achieve the first five subgoals in average, but is not able to proceed further. This demonstrates that the agent robustly learns to stack the first two objects, but fails to stack also the third one.

Figure 4: Number of subgoals reached for the case of stacking three blocks.

6.2 Tool use

Figure 5: Performance curve of the tool use experiment under different noise levels.

The results in Fig. 5 reveal that our proposed approach allows the agent to learn and complete the task in under 100 training epochs (corresponding to approximately 8 hours with 15 CPUs) even with a noise level increased up to 4% of its range. We observe that it becomes harder for agent to learn when the noise level exceeds 4%. In the case of 8% noise, the learning fails to achieve a reasonable performance in the considered time–first 100 training epochs (i.e. it only obtains less than 1% success rate). Interestingly, in cases with very low noise levels (1%-2%), the learning performance is better or at least as good as the cases of no noise added at all.

7 Discussion

The results indicate that our proof-of-concept addresses the hypotheses 1-3 as follows:

7.1 Hypothesis H.1: Ability to ground high-level actions in low-level trajectories

Our experiments indicate that the grounding of high-level actions in low-level RL-based robot control using the HER approach performs well for small to medium-sized subgoal spaces. However, the learning is not completely automated, as the approach requires the manual definition of the planning domain and of the functions that map planning domain predicates to subgoals.

For the tasks of stacking two blocks and the tool use, the subgoal space involved nine values, and both tasks could be learned successfully. The qualitative evaluation and visual inspection of the agent in the rendered simulation revealed that the grasping of the first and second block failed more often for the experiment of stacking three blocks than for the experiment of stacking two blocks. Therefore, we conclude that the subgoal space for stacking three blocks, which involves twelve values, is too large.

However, the performance on the control-level was very robust. For example, it happened frequently during the training and exploration phase that the random noise in the actions caused a block to slip out of the robot’s grip. In these cases, the agent was able to catch the blocks immediately while they were falling down. During the tool use experiment, the agent was also able to consider the rotation of the rake, to grasp the rake at different positions, and to adapt its grip it when it was slipping.

7.2 Hypothesis H.2: Ability to solve causal puzzles

The results indicate that the approach is able to solve causal puzzles if the subgoal space is not too large. The architecture depends strongly on the planning domain representation that needs to be implemented manually. The manual domain-engineering that is required for the planning is appropriate for tasks that are executed frequently, such as adaptive robotic co-working at a production line or in a drone delivery domain. Due to the caching of plans (see Sec. 4.3), we have not encountered issues with the computational complexity problem and runtime issues of the planning approach.

To investigate the performance gain provided by the causal reasoning abilities of the high-level planning layer, we have compared our approach to a non-hierarchical reinforcement learner (see Sec. 9.2, Fig. 6). The non-hierarchical approach was not able to learn to solve any of the tasks that we discuss in this paper. Therefore, we investigated a simplified version of the tool use task, which only involves pulling the rake to the target without involving any other objects. For this simplified task, our planning-based approach learned approximately ten times faster than the shallow reinforcement learner.

7.3 Hypothesis H.3: Robustness to noise

For all experiments, except the stacking of three blocks, the approach converged with a reasonable noise-to-signal ratio of four to six percent. An interesting observation is that a very low amount of noise, i.e. , improves the learning performance for most cases. We hypothesize that the noise has the effect of generalizing the input data for the neural network training, such that the parameters become more robust.

Noise is an important issue for real physical robots. Our results indicate that the algorithm is potentially appropriate for physical robots, at least for the case of grasping and moving a single block to a target location. For this case, with a realistic level of noise, the algorithm converged after approximately ten epochs (see Fig. 3). Per epoch and CPU, the agent conducts 100 training rollouts. A single rollout would take around 20 seconds on a physical robot. Considering that we used 15 CPUs, the equivalent robot training time required is approximately 83 hours. This value is reasonable for real applications. The physical training time can potentially further be lowered by applying more neural network training batches per rollout, and by performing pre-training using the simulation.

8 Conclusion

We have developed a hierarchical architecture for robotic applications in which agents must perform reasoning over a non-trivial causal chain of actions. We have employed a PDDL planner for the high level planning and we have integrated it with an off-policy reinforcement learner to enable robust low-level control.

The innovative novelty of our approach is the combination of action planning with goal-independent reinforcement learning and sparse rewards (Lillicrap2016_DDPG; Andrychowicz2017HindsightReplay). This integration allowed us to address two research problems that involve the grounding of the discrete high-level state and action space in sparse rewards for low-level reinforcement learning. We addressed the problem of grounding of symbolic state spaces in continuous-state subgoals (1), by proposing a principled predicate-subgoal mapping. This allowed us to address also the problem of mapping subgoals to low-level action trajectories (2) by means of reinforcement learning with sparse rewards using hindsight experience replay (Andrychowicz2017HindsightReplay). As a result, our approach has two advantages over other methods that combine action planning and reinforcement learning, e.g. (Grounds2005_Plan-Q; Yamamoto2018_Learning_AbductivePlanning): The low-level action space for our robotic application is continuous and it supports a higher dimensionality.

We have realized and evaluated our architecture in simulation, and we addressed three hypotheses (1-3): First, we demonstrate that the approach can successfully integrate high-level planning with reinforcement learning (1); second, we show that this makes it possible to solve simple causal puzzles (2), and, third, we demonstrate robustness to a realsitic level of sensory noise (3). The latter demonstrates that our approach is potentially applicable to real-world robotic applications.

The causal puzzles that we investigate in this paper are also relevant for hierarchical reinforcement learning (HRL) (e.g. (Levy2019_Hierarchical)), but we have not been able to identify an article that presents good results in problem-solving tasks that have a causal complexity comparable to our experiments. An empirical comparison was, therefore, not directly possible. Our approach has the advantage over HRL that it exploits domain knowledge in the form of planning domain representations. The disadvantage compared to HRL is that the domain knowledge must be hand-engineered. In future work, we plan to complement both approaches, e.g. by building on vector-embeddings to learn symbolic planning domain descriptions from scratch by means of the reward signal of the reinforcement learning. A similar approach, based on clustering of affordances, has been presented by Ugur2015, and complementing their method with the robustness of the reinforcement learner suggests significant potential. We also plan to apply the approach to a physical robot, and to reduce the amount of physical training time by pre-training the agent in our simulation and by applying domain-randomization techniques (Andrychowicz2018_Dexterity).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author Contributions

  • [leftmargin=2.5em,labelsep=1em]

  • Manfred Eppe has developed the overall concept and theoretical foundation of the method. He also implemented, developed and evaluated the computational framework together with Phuong D.H. Nguyen. Manfred Eppe has authored the core parts of Sec. 2 and Sec. 4, and has contributed significantly to all other sections.

  • Phuong D.H. Nguyen has researched and investigated significant parts of the related work in Sec. 3, and he also worked on further refinements and adaptions of the theoretical background provided in Sec. 4 of this article. He has revised and contributed also to all other sections of this article. Furthermore, Phuong D.H. Nguyen has developed the tool use experiment.

  • Stefan Wermter has actively contributed and revised the article.


Manfred Eppe and Stefan Wermter acknowledge funding by the Experiment! Programme of the Volkswagen Stiftung. Furthermore, this project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 642667 (SECURE).


We thank Erik Strahl for the technical supports with the computing servers and Fares Abawi for the initial version of the Mujoco simulation for the experiments.


9 Appendix

9.1 PDDL domain descriptions

For all PDDL definitions, we used only the STRIPS semantic requirement, and we realized the quantification operators by grounding the variables manually.

9.1.1 Block stacking

The domain description of the block stacking task is described in the following Listing 1.

(define (domain block)
    (:objects o1 ... on)
        (gripper_at ?o)
        (at_target ?o)
        (on ?o1 ?o2)
    (:action move_gripper_to_o
        :parameters (?o)
        :precondition ()
        :effect (and (gripper_at ?o) (forall ?o1 != ?o: (not (gripper_at ?o1))  (not (on ?o1 ?o)) (not (gripper_at_target))))
    (:action move_o_to_target
        :parameters (?o)
        :precondition (and (gripper_at ?o))
        :effect (and (at_target ?o))
    (:action move_o_on_o
        :parameters (?o1 ?o2)
        :precondition (and (gripper_at ?o1) )
        :effect (and (on ?o1 ?o2)   (not (on ?o2 ?o1)))
    (:action move_gripper_to_target
        :parameters ()
        :precondition ()
            (forall ?o: (not (gripper_at ?o))
Listing 1: Block stacking domain

The problem description of the block stacking task for two blocks and the case where all predicates are initially false is described in the following Listing 2.

(define (problem pb1) (:domain block)
    (:goal (and
         (at_target o1)
         (on o0 o1)
Listing 2: Block stacking problem

9.1.2 Tool use

The domain description of the tool use task is described in the following listing 3.

(define (domain tool)
    (:requirements :strips)
    (:objects obj rake)
        (gripper_at ?o)
        (at_target ?o)
        (at ?o1 ?o2)
    (:action move_gripper_to_o
        :parameters (?o)
        :precondition ()
            (and (gripper_at ?o)
                (forall ?o1 != ?o:
                    (not (gripper_at ?o1))
                    (not (at ?o0 ?o1))
                    (not (gripper_at_target))
    (:action move_o_at_o
        :parameters (?o0 ?o1)
        :precondition (and (gripper_at ?o0) )
        :effect (and (at ?o0 ?o1) )
    (:action move_o_to_target_by_o
        :parameters (?o1 ?o0)
        :precondition (and (at ?o0 ?o1) (gripper_at ?o0) )
        :effect (and (at_target ?o1) (at ?o0 ?o1) )
    (:action move_o_to_target
        :parameters (?o)
        :precondition (and (gripper_at ?o) )
        :effect (and (at_target ?o) (gripper_at ?o) )
    (:action move_gripper_to_target
        :parameters ()
        :precondition ()
            (forall ?o:
                (not (gripper_at ?o))
Listing 3: Using tool

9.2 Additional experimental results

Figure 6: Performance comparison (Hierarchical vs. shallow HER) in a simplified tool-using experiment: Robot only has to grasp the rake, then move it to its desired position; there is no blocks involving the task. In this simplified task, observation noise (up to 4%) is only applied to our approach (plotted together), while HER-only approach is trained without observation noise. While our approach allows the robot to learn the task quickly (in less than 20 epochs to achieve 90% success rate), the HER-only approach struggles to collect enough successful experiences to learn the task in more than 450 epochs (to achieve the same performance).