Log In Sign Up

Augmenting Reinforcement Learning with Behavior Primitives for Diverse Manipulation Tasks

by   Soroush Nasiriany, et al.
The University of Texas at Austin

Realistic manipulation tasks require a robot to interact with an environment with a prolonged sequence of motor actions. While deep reinforcement learning methods have recently emerged as a promising paradigm for automating manipulation behaviors, they usually fall short in long-horizon tasks due to the exploration burden. This work introduces MAnipulation Primitive-augmented reinforcement LEarning (MAPLE), a learning framework that augments standard reinforcement learning algorithms with a pre-defined library of behavior primitives. These behavior primitives are robust functional modules specialized in achieving manipulation goals, such as grasping and pushing. To use these heterogeneous primitives, we develop a hierarchical policy that involves the primitives and instantiates their executions with input parameters. We demonstrate that MAPLE outperforms baseline approaches by a significant margin on a suite of simulated manipulation tasks. We also quantify the compositional structure of the learned behaviors and highlight our method's ability to transfer policies to new task variants and to physical hardware. Videos and code are available at


page 1

page 4

page 6


Reinforcement Learning for Vision-based Object Manipulation with Non-parametric Policy and Action Primitives

The object manipulation is a crucial ability for a service robot, but it...

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Reinforcement learning agents that operate in diverse and complex enviro...

Attaining Interpretability in Reinforcement Learning via Hierarchical Primitive Composition

Deep reinforcement learning has shown its effectiveness in various appli...

Toward Robust Long Range Policy Transfer

Humans can master a new task within a few trials by drawing upon skills ...

Emergence of Different Modes of Tool Use in a Reaching and Dragging Task

Tool use is an important milestone in the evolution of intelligence. In ...

Dexterous Manipulation Primitives for the Real Robot Challenge

This report describes our approach for Phase 3 of the Real Robot Challen...

Learning compositional models of robot skills for task and motion planning

The objective of this work is to augment the basic abilities of a robot ...

I Introduction

Enabling autonomous robots to solve diverse and complex manipulation tasks has been a grand challenge for decades. In recent years, deep reinforcement learning (DRL) approaches have made great strides towards designing robot manipulation behaviors that are difficult to engineer manually 

[kalashnikov2018qtopt, openai2019learning, openai2019solving, kalashnikov2021mtopt]. Nonetheless, state-of-the-art DRL models fall short in long-horizon tasks due to the exploration challenge — the robot has to explore a prohibitively large space of possible behaviors for accomplishing a task. To remedy the exploration burden, prior DRL work has developed various temporal abstraction frameworks to exploit the hierarchical structure of manipulation tasks [coreyes18sectar, nachum2018hiro, bacon2016optioncritic, eysenbach2019sorb]. These methods learn low-level controllers, often modeled as skills or options, together with high-level controllers from trial-and-error. While they have demonstrated greater scalability than vanilla DRL methods, they often suffer from high sample complexity, lack of interpretability, and brittle generalization.

In the meantime, decades-long research in robotics has developed a rich repertoire of functional modules specialized at particular robot behaviors, such as grasping [bohg2014graspsurvey] and motion planning [karaman2011rrtstar, ijspeert2013dmps]. These pre-built functional modules, which we refer to as behavior primitives, exhibit a high degree of robustness and reusability for achieving certain manipulation goals, such as picking up objects with the end-effector and moving the robot to a target configuration in a collision-free path. In spite of their specialties, it remains a challenge for DRL algorithms to use them as the building blocks to scaffold complex tasks. The challenge is primarily due to the fact that these behavior primitives are heterogeneous by design. They take non-uniform parameters as input, operate at varying temporal resolutions, and exhibit distinct behaviors. This thus requires an algorithm to reason about the temporal decomposition of a complex task and adaptively compose these behavior primitives accordingly.

A variety of hierarchical modeling approaches in robotics have used behavior modules as low-level building blocks. Notably, task-and-motion planning [kaelbling2011hpn, garrett2020tamp, wang2021comptamp] and neural programming [xu2018neural, huang2019neural] methods have used primitives such as motion planners and pick-and-place controllers to model manipulation tasks in a compositional fashion. They require well-specified domain knowledge to perform task planning or strong human supervision to train a high-level controller with ground-truth task decomposition. These assumptions limit the scalability of these methods in realistic tasks.

Fig. 1: Overview of MAPLE. (a) We present a learning framework that augments the robot’s atomic motor actions with a library of versatile behavior primitives. (b) Our method learns to compose these primitives via reinforcement learning. (c) This enables the agent to solve complex long-horizon manipulation tasks.

In this work, we introduce MAPLE (MAnipulation Primitive-augmented reinforcement LEarning), a general DRL algorithm that harnesses a set of pre-built behavior primitives for solving long-horizon manipulation tasks. To address the exploration challenge of DRL algorithms, our method uses a library of high-level behavior primitives (such as grasping or pushing objects) in conjunction with low-level motor actions to autonomously learn a hierarchical policy (see Fig. 1). Our algorithm models each behavior primitive as an implementation-agnostic controller that produces a temporally extended behavior. At a given state, our DRL policy invokes a behavior primitive (or an atomic motor action) and instantiates it with input parameters. For example, the input parameters to a 6-DoF grasping module can be the pre-grasp end-effector pose. The selected primitive interprets the input parameters and executes one or a sequence of motor actions to realize its specialized behavior. By integrating behavior primitives into DRL algorithms, MAPLE shields away a substantial portion of complexity in manipulation planning, while leaving the flexibility to a generic reinforcement learning algorithm to discover the compositional structure of tasks without strong domain knowledge. Furthermore, by retaining low-level motor actions MAPLE can rely on these actions for the stages of tasks where the finite library of behavior primitives is insufficient to express a desired behavior.

We conduct an extensive set of experiments on a suite of eight manipulation tasks of varying complexities in the robosuite simulation framework [robosuite2020]. We compare our method to standard DRL approaches [haarnoja2018sac] that only use low-level motor actions, hierarchical DRL methods that learn options [zhang2019dac, nachum2018hiro, chitnis2020schema] or open-loop task schemas [chitnis2020schema]. MAPLE achieves a increase in task success rate compared to using only atomic actions, becoming the only method that consistently solved all single-arm tasks in the standard robosuite benchmark. We also devise a data-driven metric to quantitatively examine the compositionality of manipulation tasks contingent on the available primitives, offering new insight on the challenges and opportunities of compositional modeling for realistic manipulation tasks.

We highlight three contributions of this work: 1) We develop a novel method that augments standard DRL algorithms with pre-defined behavior primitives to reduce the exploration burden; 2) We validate the effectiveness of our method in solving diverse manipulation tasks and quantitatively analyze the compositional structure of these tasks; and 3) We show that the modularity and abstraction offered by the behavior primitives facilitate knowledge transfer of the learned policies to new task variants and physical hardware.

Ii Related Work

Deep Reinforcement Learning. Prior work on DRL has investigated a number of approaches to solve long-horizon tasks, through improved exploration strategies [bellemare2016unifying, pathak2017curiositydriven, houthooft2017vime, pathak2019selfsupervised], learning options [nachum2018hiro, bacon2016optioncritic, smith2018inference, zhang2019dac, bagaria2020dsc], unsupervised skill discovery [eysenbach2018diversity, sharma2020dynamicsaware], and integrating planning [eysenbach2019sorb, nasiriany2019planning]. Despite these efforts, today’s DRL methods still struggle in long-horizon robotic tasks due to the exploration burden of learning from scratch. A growing amount of work has examined the use of offline data to alleviate the exploration burden in DRL, namely through demonstration-guided RL [rajeswaran2018learning, nair2018overcoming, gupta2019relay], learned behavioral priors [pertsch2020spirl, singh2020parrot] and action spaces [ajay2020opal, allshire2021laser] from demonstrations, and offline RL [fujimoto2019offpolicy, mandlekar2020iris, kumar2020conservative, fu2021d4rl]. While promising, these methods can be difficult to scale up due to the costs of acquiring offline data.

Hierarchical Modeling in Robotics. Outside of DRL, there has been a plethora of work in robotics dedicated to building customized functional modules that emit specific robot behaviors, such as grasping [bohg2014graspsurvey, mahler2017dexnet2] and motion planning [karaman2011rrtstar, amato1996randomized]. Prior works on task-and-motion planning [kaelbling2011hpn, garrett2020tamp, wang2021comptamp] and neural programming [xu2018neural, huang2019neural] have developed hierarchical models that leverage these modules as building blocks to scaffold manipulation tasks. While these methods have demonstrated impressive capabilities in restrictive domains, their applicability has been limited by their reliance on domain knowledge or human supervision.

To bridge the gap between hierarchical models and DRL algorithms that learn from scratch, recent work has harnessed pre-built primitives, such as model-based planners [lee2020guapo], motion planners [yamada2020mopa, xiali2020relmogen], movement primitives [ijspeert2013dmps, neumann2014skills], and pre-built skills [chitnis2020schema, lee2020skills, strudel2020skills, sharma2020skills, simeonov2020skills], to expedite DRL algorithms. These approaches aim at retaining the flexibilities of RL algorithms to learn general-purpose behaviors while benefiting from the temporal abstraction provided by the primitives. However, these works are limited as they are confined to using only one or two specific primitives [lee2020guapo, yamada2020mopa, xiali2020relmogen], employ rigid primitives that are not reconfigurable [strudel2020skills, sharma2020skills], or hard-code how the primitives are composed [simeonov2020skills]. In contrast, our method adopts a set of versatile primitives and composes them in conjunction with low-level motor actions to solve diverse manipulation tasks.

Reinforcement Learning with PAMDPs. Our formalism specifically falls under the established reinforcement learning framework of Parameterized Action MDPs (PAMDPs) [masson2015reinforcement], in which the agent executes a parameterized primitive at each decision-making step. We note that several prior works [hausknecht2016deep, wei2018hierarchical, xiong2018parametrized, fan2019hybrid, jain2020actions] have adapted off-the-shelf deep RL algorithms to the PAMDP setting. Nonetheless, they have focused on relatively simple game domains, shielding away practical challenges in robot manipulation, such as high-dimensional continuous state/action spaces and heterogeneous primitives. Our work is closest to Chitnis et al. [chitnis2020schema] and Lee et al. [lee2020skills], which have modeled robot manipulation with PAMDPs. We provide empirical comparisons to demonstrate the limitations of their modeling choices, yielding less competitive performance in challenging manipulation tasks than ours.

Iii Method

Our goal is to enable robots to leverage behavior primitives to solve manipulation tasks effectively and efficiently. To that end, we seek a library of behavior primitives that serve as the building blocks to scaffold manipulation tasks and a reinforcement learning algorithm that composes these primitives to solve tasks. To evaluate whether our algorithm facilitates compositional behaviors, we also propose a metric to quantify the degree to which the resulting learned behavior is compositional. See Fig. 1 for an overview of our method.

Iii-a Decision-Making with Parameterized Behavior Primitives

We adopt reinforcement learning (RL) as the underlying decision-making framework. The objective of RL is to maximize the expected infinite sum of discounted rewards in a Markov Decision Process (MDP), defined by the tuple

. The entities in the tuple represent the state space, the action space, the reward function, the transition function, the initial state distribution, and the discount factor. In most robotic RL problems, the action space consists of all atomic actions provided by the robot, such as joint torque commands or end-effector displacements. We would like to augment this action space with a heterogeneous library of behavior primitives that perform semantically meaningful behaviors. Formally, each behavior primitive — which we will call primitive for brevity — is represented by a control module that executes a finite, variable sequence of atomic actions to achieve a certain behavior, where the exact action sequences are specified by input parameters . Here is the dimension of the input parameters to the primitive that varies across different primitives. To incorporate these behavior primitives into the action space, we recast our decision-making problem as a Parameterized Action MDP (PAMDP) [masson2015reinforcement]. Under this formulation the agent executes at each decision-making step a parameterized action consisting of the type of primitive and its parameters .

Iii-B Behavior Primitives: Building Blocks for Manipulation

We are interested in equipping agents with a library of versatile primitives that serve as the core building blocks for diverse manipulation tasks. To devise a general learning framework for composing primitives, our decision-making algorithm assumes no knowledge on the detailed implementations of these primitives. These primitives can come in any generic form, ranging from closed-loop skills learned via reinforcement  [haarnoja2018sac, schulman2017ppo]

or imitation learning 

[osa2018il], analytical motion planners [karaman2011rrtstar], to even full-fledged grasping systems [mahler2017dexnet2, bohg2014graspsurvey]. Regardless of their inner workings, we must ensure that our primitives are versatile and adaptive to behavioral variations. In our learning framework, we consider these primitives as functional APIs that take input parameters that instantiate action execution. The input parameters usually have clear semantics, such as the 6-DoF end-effector pose for a grasping primitive or a target robot configuration for a motion planning primitive. These parameters significantly improve the flexibility and utility of our primitives for solving complex tasks. Even so, we recognize that our library of primitives may still not be universally applicable in every setting, and equipping the agent solely with these primitives may limit the set of possible behaviors that the agent can achieve. We address this limitation by introducing an additional atomic primitive dedicated to performing atomic robot actions. The addition of this atomic primitive will allow the agent to fill in any missing gaps that cannot be fulfilled by the other primitives.

Here we design a library of five primitives, including prehensile and non-prehensile motions, that forms the basis for many manipulation tasks:

  • Reaching: The robot moves its end-effector to a target location , specified by the input parameters. Execution takes at most 15 atomic actions.

  • Grasping: The robot moves its end-effector to a pre-grasp location at a yaw angle , specified by the input parameters, and closes its gripper. Execution takes at most 20 atomic actions.

  • Pushing: The robot reaches a starting location at a yaw angle and then moves its end-effector by a displacement . The input parameters are 7D. Execution takes at most 20 atomic actions.

  • Gripper Release: The robot repeatedly applies atomic actions to open its gripper. This primitive has no input parameters. Execution takes 4 atomic actions.

  • Atomic: The robot applies a single atomic action of dimension .

We implemented these primitives as hard-coded closed-loop controllers, each requiring only a handful of lines of code. We highlight that these primitives take input parameters of different dimensions, operate at variable temporal lengths, and produce distinct behaviors. These properties make them challenging to utilize in a learning framework. In the following, we will introduce our algorithm for composing these primitives to solve diverse manipulation tasks.

Fig. 2: Policy Architecture. We adopt a hierarchical policy, with a high-level task policy that determines which primitive to apply and a low-level parameter policy that determines how to instantiate that primitive.

Iii-C Composing Primitives via Reinforcement Learning

We follow the PAMDP framework outlined in Section III-A, where at each decision-making step a policy must select a discrete behavior primitive type and its corresponding continuous parameters . Previous work has explored various policy structures that reason over parameterized primitives. The simplest approach is a flat policy [xiali2020relmogen, hausknecht2016deep] that outputs a distribution over the primitive type and all primitive parameters . A major drawback of this approach is that the total number of policy outputs can quickly become intractable as additional primitives are introduced. We address this limitation with a hierarchical policy where at the high level a task policy determines the primitive type and at the low level a parameter policy determines the corresponding primitive parameters

. For implementation, we represent the task policy as a single neural network and the parameter policy as a collection of sub-networks, with one sub-network dedicated for each primitive. This enables us to accommodate primitives with heterogeneous parameterizations. To allow batch tensor computations across primitives with different parameter dimensions, these parameter policy sub-networks all output a “one size fits all” distribution over parameters

, where is the maximum parameter dimension over all primitives. At primitive execution we simply truncate the parameters to the length of the chosen primitive . See Fig. 2 for an illustration of our policy architecture. In addition to reducing the overall number of output parameters our hierarchical design facilitates modular reasoning, delegating the high-level to focus on which primitive to execute and the low-level to focus on how to instantiate that primitive. We note that a few prior works have previously explored this hierarchical design [wei2018hierarchical, fan2019hybrid] but to our knowledge we are the first to demonstrate its utility on complex manipulation domains with a set of heterogeneous primitives.

In principle, we can integrate our policy architecture with any DRL algorithm designed for continuous control. We choose Soft Actor-Critic (SAC) [haarnoja2018sac], a state-of-the-art DRL algorithm that aims to maximize environment rewards as well as the policy entropy. We modify the standard critic neural network and actor neural network with our critic network and our hierarchical policy networks , . Under these changes the losses for the critic, task policy, and parameter policy are defined respectively (we highlight components pertaining to the task policy in red and the parameter policy in blue):

Here and control the maximum entropy objective for the task policy and parameter policy, respectively.

Iii-D Facilitating Exploration with Affordances

Compared with existing methods that reason purely over atomic actions, our algorithm benefits from accelerated exploration due to the temporal abstraction provided by our behavior primitives. However, as previous work [pertsch2020spirl] has noted, even reasoning with temporally extended actions can present an exploration challenge. One way to address this issue is to equip the agent with affordances that help to discern the utility of actions in different settings. For example, a grasping skill is only appropriate when applied in the vicinity of graspable objects, and a pushing skill is only appropriate in the vicinity of pushable objects.

In our framework, these affordances can be expressed by adding to the reward function an auxiliary affordance score that measures the affinity for parameters at a particular state for a given primitive . These affordances scores can in principle come from learned models trained on robot interaction data [simeonov2020skills, nagarajan2020learning, mandikal2020graff, mo2021where2act, xu2021daf] or human data [do2018affordance, fang2018demo2vec, nagarajan2019grounded, kokic2020learning]. Nonetheless, as our primitive parameters carry clear semantic meanings, we can analytically define these affordance scores based on the objects’ physical states. Concretely, for the atomic and gripper release primitives, we always give an affordance score of to enable the universal applicability of these primitives. For the remaining reach, grasp, and push primitives we implement general, easy-to-define affordances encouraging the agent to reach relevant areas of interest in the workspace. More specifically, these primitives all involve reaching a location and we encourage the agent to specify the reaching parameters to be within a threshold of a set of keypoints :

The keypoints for pushing are the locations objects to push, the locations of objects to grasp for grasping, and the target reaching location for reaching.

Door Opening
Pick and Place
Nut Assembly
Peg Insertion
Fig. 3: Simulated Environments. We perform evaluations on eight manipulation tasks. The first six come from the robosuite benchmark [robosuite2020]. We designed the last two to test our method in multi-stage, contact-rich tasks: Cleanup requires storing a spam can into a storage bin and a jello box at a corner; Peg Insertion requires inserting a peg into a block.
Fig. 4: Main Results.

Learning curves showing average episodic task rewards throughout training, normalized between 0 and 100. All experiments are averaged over 5 seeds, with shaded regions depicting the standard deviation.

Iii-E Quantifying Compositionality

Our framework relies on the hypothesis that most manipulation tasks have an intrinsic compositional structure and that our algorithm can discover this structure. To examine this hypothesis we propose to measure the degree to which our learned agent exhibits compositional behaviors with a quantifiable metric. Assume that we are given a set of trajectories in which the agent solved a task : . The corresponding task sketches capture the high-level task semantics and provide useful abstractions through which we can analyze the compositional structure of these trajectories.

Intuitively, agents that demonstrate compositional reasoning will express recurring patterns of behaviors across their task sketches and prefer the use of high-level primitives over low-level ones. We quantify this intuition by computing the Levenshtein distance [levenshtein1966binary] among task sketches, which in our context measures the minimum number of single-token edits (insertions, deletions, or substitutions) needed to transform one task sketch to another. In our task sketches we represent each non-atomic primitive type as a unique token, and in order to explicitly discourage the use of low-level atomic actions, we represent each individual occurrence of an atomic primitive in our task sketches as a unique token. Given a task and available primitives , we compute the compositionality of the agent’s behavior as the average pairwise normalized score between the resulting task sketches:


Note that this measure is contingent on the choice of behavior primitives in the library , and we can use this measure to compare the effectiveness of different libraries.

One question that arises is whether MAPLE incentivizes the agent to discover compositional task structures in the first place. While there is no explicit mechanism to discover recurring patterns of primitives, our algorithm exhibits compositional reasoning by preferring the use of high-level primitives over low-level ones. Due to the temporal abstraction encapsulated by the high-level primitives, the agent can make far greater progress toward solving the task by using high-level primitives and thus receives higher average reward per timestep. This incentivizes the agent to choose higher-level primitives over lower-level actions whenever appropriate.

Iv Experiments

Our experiments study 1) whether our method can compose pre-built behavior primitives and atomic actions to solve complex tasks, 2) the degree to which the learned behavior is compositional, and 3) whether our approach is amenable to transfer to task variants and to real hardware. Videos of the experiments can be found on the project webpage111

Iv-a Experimental Setup

We examine these questions on robosuite [robosuite2020], a framework for simulated robot manipulation tasks. We consider a comprehensive suite of eight manipulation tasks of varying complexities (see Fig. 3). For all tasks we adopt a Franka Emika Panda robot arm equipped with a parallel jaw gripper (with the exception of the wiping task). The robot is controlled through end-effector displacements with an operational space controller (OSC) [khatib1995osc]. At each decision-making step our agent can execute either an atomic OSC action or one of the temporally extended non-atomic primitives outlined in Section III-B. In return the agent receives 1) a dense reward signal indicating task progress and 2) an observation comprising the robot’s proprioceptive state and pose information of the objects in the environment.

Iv-B Quantitative Evaluations

We compare our method (MAPLE) to five baselines. The first baseline uses exclusively atomic actions (Atomic), which corresponds to the standard Soft Actor-Critic model [haarnoja2018sac] trained on end-effector commands. To understand the effect of hierarchy on our policy design, we compare to a flat variant where the policy outputs the primitive type and parameters independently (Flat), following the design by Lee et al. [lee2020skills] and Neunert et al. [neunert2020continuousdiscrete]. We also compare to a variant of our method using an open loop task policy (Open Loop), following Chitnis et al. [chitnis2020schema] which suggests utilizing an open-loop task schema improves the sample efficiency of the learning algorithm. Next, we compare to HIerarchical Reinforcement learning with Off-policy correction (HIRO[nachum2018hiro] and Double Actor-Critic (DAC[zhang2019dac], state-of-the-art hierarchical DRL methods which aim to learn low-level policies (or options) along with high-level controllers. HIRO failed to make progress and we thus omit it from our results. Finally, we compare to a self baseline where we include all primitives except the atomic primitive (MAPLE (Non-Atomic)), to understand whether we need atomic actions to satisfy behaviors that cannot be fulfilled by the non-atomic primitives. All baselines using behavior primitives use the affordance score outlined in Section III-D.

Fig. 5: Analyzing Learned Behavior. (Top) We visualize the task sketches that our agent has learned. Each row corresponds to a single sketch progressing temporally from left to right. For each task we also report the compositionality score . (Bottom) We visualize the behavior for a peg insertion sketch.

Fig. 4 outlines environment rewards throughout training. We also evaluated the final task success rates at the end of training: MAPLE achieved the highest average success rate across all baselines (90%), compared to for the Atomic baseline, for Flat, for Open Loop, for DAC, and for MAPLE (Non-Atomic). First, we see that the inclusion of non-atomic primitives allows MAPLE to significantly outperform the Atomic baseline, achieving on average 2-3 higher rewards and higher success rate. Qualitatively we found that the Atomic baseline fails to advance past the first stage in most tasks while our method is able to successfully solve all tasks. Next we find that the Flat baseline is unable to reliably solve all tasks, demonstrating that our hierarchical policy design is key to reasoning over a heterogeneous set of primitives. While the Open Loop baseline is able to solve basic tasks such as Door Opening and Pick and Place, it struggles with tasks that require the agent to adaptively reason about the current state of the task. DAC is only able to solve the Lift task, highlighting the difficulty of learning complex tasks from scratch even when employing temporal abstraction. Finally we find that the Non-Atomic self baseline is on par with our method in most tasks, yet it notably fails for Peg Insertion as the non-atomic primitives are not expressive enough to perform the contact-rich insertion phase. Together, these results highlight that given an appropriate library of primitives and an appropriate policy structure we can solve a wide range of manipulation tasks.

Iv-C Model Analysis

(a) Task Transfer
(b) Ablations
Fig. 6: (a) Task Transfer. We transfer the learned task sketch from a source task (pick and place can) to a semantically similar task variant (pick and place bread), enabling us to learn the target task over faster. (b) Ablations. Without affordances, reaching, or grasping, the agent is unable to solve tasks due to the exploration burden.

Emergence of Compositional Structures. We present an analysis of the task sketches that our method learned for each task in Fig. 5. We see evidence that the agent unveils compositional task structures by applying temporally extended primitives whenever appropriate and relying on atomic actions otherwise. For example, for the peg insertion task the agent leverages the grasping primitive to pick up the peg and the reaching primitive to align the peg with the hole in the block, but then it uses atomic actions for the contact-rich insertion phase. In Fig. 5 we also quantify the degree to which these task sketches are compositional via the compositionality score that we defined in Eq. 1. As we can see, tasks involving contact interactions such as Peg Insertion and Wiping have lower scores than prehensile tasks such as Pick and Place and Stacking.

Transfer to Semantically Similar Task Variants. We have seen how task sketches enable interpretability by serving as blueprints of high-level semantic task structure. We can leverage these task sketches to accelerate learning on similar task instances. We propose to re-use the task sketch from a semantically similar task, and only learn the corresponding primitive parameters. We validate this idea with a preliminary experiment on the Pick and Place domain, where we transfer the task sketch from a source task of placing a soda can into one bin, to a target task of placing a loaf of bread into a different bin. As shown in Fig. 5(a), we are able to solve the bread task significantly faster than learning the task from scratch with a sample efficiency of over . This result implies that our task sketch serves as a high-level scaffold of a manipulation task, which can be re-used by learning algorithms for faster adaptation to related task variants.

Ablation Study. We perform an ablation study examining the role of affordances and individual manipulation primitives in facilitating exploration. We specifically perform experiments on the Pick and Place task, comparing our method (Ours) to ablations 1) without affordances in the reward function (No Aff), 2) without the reaching skill (No Reach), and 3) without the grasping skill (No Grasp). We see in Fig. 5(b) that without these components the agent fails to solve the task, underscoring that our method is reliant on the appropriate primitive skills and affordances to effectively overcome the exploration burden.

Iv-D Real-World Evaluation

We conclude with an evaluation on real-world copies of the Stack and Cleanup tasks (see Fig. 7

). As our behavior primitives offer high-level action abstractions and encapsulate low-level complexities of motor actuation, our policies can directly transfer to the real world. We trained MAPLE on simulated versions of these tasks and executed the resulting policies to the real world. We re-implemented our behavior primitives on the real robot and used an off-the-shelf pose estimation model 

[tremblay2018dope] to estimate the environment states as the model input. We achieved an average success rate of on Stack and on Cleanup.

Fig. 7: Transfer to Real-World Tasks. We transfer our policy trained on simulated environments to the real-world Stack and Cleanup tasks.

V Conclusion

We presented MAPLE, a reinforcement learning framework that incorporates behavior primitives in conjunction with low-level motor actions to solve complex manipulation tasks. Our experiments demonstrate that behavior primitives can significantly improve exploration while low-level motor actions allow us to retain flexibility to learn intricate behaviors. Our work opens the possibility for several avenues for future work. First, learning affordances using data-driven methods [simeonov2020skills, kokic2020learning, mandikal2020graff, mo2021where2act] can expand the scalability of our method. Second, while atomic actions can help fill in gaps where the primitives are insufficient (such as peg insertion), we are unable to fill in large gaps that require a significant number of low-level action executions (as seen in the ablation experiments). Further research on exploration and credit assignment is needed to overcome these challenges. Finally, an exciting avenue for future work is to continually discover recurring compositions of primitives and add them to the library of primitives, which can ultimately enable curriculum learning of progressively more challenging tasks.


We would like to thank Yifeng Zhu for assisting with the real-world experiments and Abhishek Joshi for providing simulation rendering content. We would also like to thank Yifeng Zhu, Braham Snyder, and Josiah Wong for providing feedback on this manuscript. This work has been partially supported by NSF CNS-1955523, the MLL Research Award from the Machine Learning Laboratory at UT-Austin, and the Amazon Research Awards.

Appendix A Implementation Details

A-a Behavior Primitives

We elaborate on the manipulation primitives that we outlined in Section III-B

: reaching, grasping, pushing, gripper release, and atomic. We classify all primitives that are not the atomic primitive as


primitives. Under the hood, all non-atomic primitives execute a variable sequence of atomic actions, either until the primitive is successfully executed or until a time limit is reached. All atomic actions specifically interface with the Operational Space Control (OSC) controller, which has 5 degrees of freedom: 3 degrees to control the position of the end effector, 1 degree to control the yaw angle, and (for all tasks but wiping) 1 degree to open and close the gripper.

We elaborate further on our non-atomic primitives. The gripper release primitive executes a fixed number of atomic actions to open the gripper. The reaching, grasping, and pushing primitives are hard-coded closed-loop controllers that all entail a reaching phase, either for reaching the starting location (for pushing) or for reaching the final location (for reaching and grasping). To implement this functionality for table-top environments (all except door), the robot first lifts its end effector to a pre-specified height, then hovers to the target XY position, and finally lowers its end effector to the target location. For other environments (door), the robot moves to toward the target location directory via the OSC controller. During this reaching phase, the reaching primitive keeps its gripper closed (except for the non-tabletop environments like door) and the grasping and pushing primitives keep their grippers open. The grasping and pushing primitives can be configured to achieve a specified yaw angle, which they satisfy during the reaching phase simultaneously while applying end effector displacements. Upon reaching, the grasping primitive emulates grasping by closing its gripper and the pushing the primitive emulates pushing by applying end effector displacements in a specified direction.

A-B Algorithm

Our algorithm implementation is based on Soft Actor-Critic. Our algorithm alternates between collecting on-policy transitions in the environment and performing off-policy training on data sampled from the replay buffer. Training specifically entails optimizing the Q network, task policy, and parameter policy via gradient descent. As in the original SAC implementation, we use the reparameterization trick with respect to the parameter policy loss in order to reduce the variance of our gradient estimates. While we assume continuous primitive parameters we can also represent discrete parameters and apply reparameterization with the Gumbel-Softmax trick 

[jang2017categorical, maddison2017concrete]. We provide a full outline of our algorihtm in Algorithm 1.

1:Initialize Q network , task policy , parameter policy , replay buffer
2:for iteration  do
3:   for episode  do {Exploration Phase}
4:      Initialize timer
5:      Initialize episode
6:      while episode not terminated do
7:         Sample primitive type from task policy
8:         Sample primitive parameters from parameter policy
9:         Truncate sampled parameters to dimension of sampled primitive
10:         Execute and in environment, obtain reward and next state
11:         Add affordance score to reward
12:         Add transition to replay buffer
13:         Update timer
14:      end while
15:   end for
16:   for training step  do {Training Phase}
17:      Update Q network:
18:      Update task policy:
19:      Update parameter policy:
20:   end for
21:end for
Algorithm 1 MAnipulation Primitive-augmented reinforcement LEarning  (MAPLE)

A-C Affordance Score

We elaborate on the affordance score introduced in Section III-D:

The keypoint is dependent on the primitive and the current state . For example for the cleanup task, the keypoint for the pushing primitive is the location of a pushable object (the jello box), the keypoint for a grasping primitive is the location of a graspable object (the spam can), and the keypoint for the reaching primitive is the location of the bin. If there are multiple keypoints of interest we calculate the affordance score corresponding to each keypoint and consider the maximum score. If no applicable keypoint exists for a primitive (e.g. there are no pushable objects in door opening) we give an affordance score of . By default we set the threshold to for grasping, for reaching, and for pushing. There are a few exceptions for tasks that need larger affordance regions for reaching large objects.

A-D Flat Baseline

We considered two variants for our flat baseline. One variant, which has been explored by prior work [xiali2020relmogen, hausknecht2016deep], outputs a distribution over the primitive type and the parameters for all primitives. As we discussed in Section III-C under this approach the number of policy outputs scales linearly with the total number of primitive parameters, which can lead to optimization difficulties for large behavior libraries. Empirically we found this to be the case, as we were unable to make any progress on any of our tasks despite extensive hyperparamter tuning. Neunert et al. [neunert2020continuousdiscrete] proposed an alternative approach of replacing the distribution over all parameter outputs with the “one size fits all” distribution that we described in Section III-C. Parameter selection occurs by independently sampling a primitive type and parameters , and subsequently truncating the sampled parameters by the dimension of the sampled primitive type. This sampling strategy was also adopted by Lee et al. [lee2020skills]. We note that this independent sampling process is in contrast to our two-stage hierarchical process. We adopted this variant as our flat baseline, and in Figure 4 we see that it often leads to sub-optimal performance. We hypothesize that this is due the fact that the parameter selection process is not informed by the primitive type selection process, which reduces the agent’s utility especially when dealing with primitives that have heterogeneous parameter structures.

A-E Open Loop Baseline

Our open loop baseline follows an open loop task schema, and is inspired from Chitnis et al. [chitnis2020schema]. The open loop baseline and our method share identical implementations, except for the input to the task policy: our method takes in the current environment observation while the open loop baseline takes in only the current episode timestep. We highlight that while our implementation is inspired from Chitnis et al. [chitnis2020schema], there are notable differences. Their update rule for the “task policy” (or equivalent thereof) does not use gradient descent, relies on on-policy sampling, and is designed for the sparse reward setting only. We found these assumptions to be restrictive for our algorithmic and task setup, and we instead use gradient-based, off-policy reinforcement learning methods which can work with sparse or dense rewards. Despite these differences, we believe that our open loop baseline captures the essence of the ideas proposed in Chitnis et al. [chitnis2020schema] – namely that open loop task shemas can enable more efficient and effective learning. As we show in Figure 4 however, we did not find this to be the case for the relatively more complex tasks in our suite of manipulation domains.

A-F DAC Baseline

We considered a number of potential methods as our representative option-learning baseline. While prominent prior work [bacon2016optioncritic, klissarov2017learning, zhang2019dac] has focused on learning options, we verified that Double Actor-Critic (DAC) achieves superior performance on the OpenAI HalfCheetah-v2 task and our lift task, so we chose DAC as our representative options baseline. We also considered Deep Skill Chaining [bagaria2020dsc], another recent work that learns options; however it was not applicable to our manipulation domains given that it is designed primarily for goal-based navigation agents. We used the implementation publicly released by the DAC authors 222

, and we adopted the hyperparameters that they suggested in their paper.

Appendix B Experimental Setup

B-a Environments

We conduct experiments on eight manipulation tasks of varying complexities, spanning diverse prehensile and non-prehensile behaviors. The first six come from the standard robosuite benchmark [robosuite2020]. We designed the last two (cleanup, peg insertion) to evaluate our method in multi-stage, contact-rich tasks. We elaborate on each as follows:

Lift: the robot must pick up a cube and lift it above the table.
Door Opening: the robot must turn the door handle and open the door.
Pick and Place: the robot must pick up a soda can and place it into a specific target compartment.
Wipe: the robot must wipe a table containing spilled debris. A penalty is given if the robot presses too hard into the table.
Stack: the robot must stack a cube on top of another cube.
Nut Assembly: the robot must fit a nut tool onto the round peg.
Cleanup: the robot must store a spam can into a storage bin and store a jello box at the upper right corner.
Peg Insertion: the robot must pick up the peg and insert it into the opening of a wooden block.

B-B Training

We provide a full list of our algorithm hyperparameters in Table I. We note a few additional details. For a consistent comparison across baselines, our episode lengths are fixed to atomic timesteps, meaning that we execute a variable number of primitives until we have exceeded the maximum number of atomic actions for the episode. Also, for the first 600k environment steps we set the target entropy for the task policy and parameter policy to a high value to encourage higher exploration during the initial stages of training.

Hyperparameter Value
Hidden sizes (all networks)
Q network and policy activation ReLU
Q network output activation None
Policy network output activation tanh
Optimizer Adam
Batch Size
Learning rate (all networks)
Target network update rate

# Training steps per epoch

# (Low-level) exploration actions per epoch
Replay buffer size
Episode length (# low-level actions) 150 (except wipe, 300)
Discount factor
Reward scale
Affordance score scale
Automatic entropy tuning True
Target Task Policy Entropy , is number of primitives
Target Parameter Policy Entropy
TABLE I: Hyperparameters for our algorithm

B-C Evaluation

We elaborate on the evaluation protocol for our experiments in Figure 4. We evaluate each experimental variant (combination of task and method) over 5 seeds and we (1) plot the agent’s rewards throughout training and (2) report the task success rate at the end of training. Specifically for the reward plots, we evaluate the agent’s average episodic task rewards (excluding the affordance reward) at regularly spaced training checkpoints every environment exploration steps. The episodic rewards are averaged over 20 episodes and are normalized between and , where corresponds to the agent receiving the maximum possible reward at every single timestep of the episode. We post-process the plots, showing the moving average of results over the last environment steps. For reporting the final task success rate, we load the final training checkpoint and report the average task success rate over 20 episodes. Success rates for our tasks are defined as follows:

  • Lift: whether the block is above a height threshold

  • Door: whether the door angle is past a threshold

  • Pick and Place: whether the can is in the correct target bin and the robot is not holding the can

  • Wipe: whether all of the debris is wiped off the table

  • Stack: whether the smaller cube is on top of the larger cube and the robot is not holding either cube

  • Nut Assembly: whether the nut is fitted completely onto the round peg and the robot is not holding the nut

  • Cleanup: whether the spam can is in the bin and the jello box is within a threshold distance away from the table corner

  • Peg Insertion: whether the peg is inserted into wooden block past a threshold distance

Final task success rates for all baselines across all tasks are outlined in Table II.

Lift Door Pick and Place Wipe Stack Nut Assembly Cleanup Peg Insertion
Atomic [haarnoja2018sac] 98.0 2.4 0.0 0.0 0.0 0.0 18.0 18.3 38.0 28.7 0.0 0.0 0.0 0.0 0.0 0.0
Flat [lee2020skills, neunert2020continuousdiscrete] 61.0 47.8 100.0 0.0 1.0 2.0 22.0 9.8 98.0 2.4 0.0 0.0 0.0 0.0 8.0 13.6
Open Loop [chitnis2020schema] 43.0 43.5 100.0 0.0 81.0 38.0 16.0 4.9 85.0 3.2 0.0 0.0 0.0 0.0 0.0 0.0
DAC [zhang2019dac] 75.0 12.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.0 32.0
MAPLE (Non-At.) 100.0 0.0 100.0 0.0 100.0 0.0 42.0 9.8 99.0 2.0 93.0 14.0 100.0 0.0 0.0 0.0
MAPLE (ours) 100.0 0.0 100.0 0.0 95.0 7.7 42.0 11.7 98.0 2.4 99.0 2.0 91.0 5.8 100.0 0.0
TABLE II: Final Task Success Rates (%)

We also elaborate on the evaluation protocol for the compositionality scores that we report in Figure 5. Our compositionality scores are averaged over 5 seeds for each task. For each seed we sample 50 task sketches from the last training checkpoint and we discard sketches that did not correspond to the agent solving the task. Of these remaining task sketches we calculate the compositionality score according to Equation 1.

B-D Task Sketch Transfer Experiments

For our task sketch experiments we first extract a set of task sketches from the source task. We subsequently select the sketch that has the lowest Levenshtein distance with all other task sketches. In the case of our pick and place task this was {Grasp, Reach, Release}. Once we have extracted the sketch from the source task, we train on the target task with a fixed task sketch of . For each episode we iterate through the sequence of primitives in , repeating each primitive up to 5 times until the agent receives high affordance reward, before moving onto the next primitive in the sketch. Upon executing all of the primitives in the task sketch the agent executes 10 atomic primitives to satisfy any behaviors that it was not able to fulfill with the sketch alone, and then the episode terminates.

B-E Real-World Experiments

We performed evaluations on two real-world manipulation tasks:

Stack: the robot must pick up the butter box and stack it on top of the popcorn box.
Cleanup: the robot must (1) pick up the butter box and place it into the bin and (2) push the popcorn to the right side of the table (the white area).

Both tasks resemble the simulated stack and cleanup tasks outlined in Section B-A, but have differences in the size of the objects, table size, and workspace layout. To account for these differences we designed variations of our existing simulated stack and cleanup tasks to match the characteristics of our real-world tasks. We trained policies in simulation (until convergence) and transferred them to the real-world for evaluation.

For perception we use the robot proprioception data from the robot’s on-board sensors, in conjunction with pose estimates of the objects using the deep object pose estimation system 

[tremblay2018dope] paired with a single Microsoft Kinect camera. Under this setup the objects are sometimes out of the camera view or are occluded by the robot arm; in these cases the pose estimator does not return estimates of the object. We mostly alleviate such conditions through a three step procedure: (1) the robot lifts its end-effector to a pre-determined location in the air where occlusions are minimized, (2) the pose estimator re-computes the poses of the objects, (3) the robot moves back to its initial location. At the end of step (3), the pose of objects are assumed to be the pose estimates from step (2), with the exception of objects that were moved by the robot during step (1). Such objects comprise objects that the robot was already holding before step (1) and that the robot subsequently lifted into the air during step (1). For such objects, we compute the pose of the object as the final robot pose in addition to the relative pose difference of the robot end effector and object during step (2). In addition to handling occlusions, we noticed that the pose estimation system routinely made small errors when estimating the position and orientation of the objects. To minimize the influence of such errors, we hard-coded the pitch and yaw angles of the objects (as they were always flat either on the table or in the air) and the height of the object whenever it was detected to lie on the table. These constraints were necessary to ensure reliable perception estimates, but we anticipate that with improved perception systems in the future such constraints can be relaxed.

For evaluation, we performed 30 trials for each task, where the robot was allowed a maximum of 20 primitive calls per episode. We recorded an average success rate of for stack (in contrast to in simulation), and for cleanup (in contrast to in simulation). Most failures were either due to the robot repeatedly applying poor grasping actions or the robot hitting its joint limits, subsequently triggering a safety call to halt the robot.