Learning robotic manipulation tasks using reinforcement learning with sparse rewards is currently impractical due to the outrageous data requirements. Many practical tasks require manipulation of multiple objects, and the complexity of such tasks increases with the number of objects. Learning from a curriculum of increasingly complex tasks appears to be a natural solution, but unfortunately, does not work for many scenarios. We hypothesize that the inability of the state-of-the-art algorithms to effectively utilize a task curriculum stems from the absence of inductive biases for transferring knowledge from simpler to complex tasks. We show that graph-based relational architectures overcome this limitation and enable learning of complex tasks when provided with a simple curriculum of tasks with increasing numbers of objects. We demonstrate the utility of our framework on a simulated block stacking task. Starting from scratch, our agent learns to stack six blocks into a tower. Despite using step-wise sparse rewards, our method is orders of magnitude more data-efficient and outperforms the existing state-of-the-art method that utilizes human demonstrations. Furthermore, the learned policy exhibits zero-shot generalization, successfully stacking blocks into taller towers and previously unseen configurations such as pyramids, without any further training.READ FULL TEXT VIEW PDF
The main idea in reinforcement learning is to incentivize actions that maximize rewards. Unlike video games, where rewards are readily available, for manipulation tasks, a reward function must be manually constructed. For example, to pick and place a block, the rewards might be inversely proportional to the manipulator’s distance from the block and the block’s distance from the target location. Such rewards that frequently provide information about the task are known as dense rewards. Using dense rewards, an agent can get stuck in a local minimum and never complete the desired task. It is well known that intuitively reasonable reward functions can often result in unexpected or undesirable behaviors . This makes reward design very challenging.
An alternative is to provide the agent with sparse rewards, which may either be provided after the agent completes the overall task (i.e., terminal reward) or extremely intermittently when the agent completes critical steps (i.e., step-wise rewards). Sparse rewards are more straightforward to define than dense rewards. However, because many tasks require execution of a long sequence of actions, sparse rewards drastically complicate the challenges of exploration and credit-assignment. Training with sparse rewards, therefore, either completely fails or requires massive amounts of data.
Practical reinforcement learning systems have sidestepped the challenge of learning from sparse rewards by either using (a) human demonstrations, (b) sim2real transfer, (c) careful environmental instrumentation to simplify the task or (d) meticulous reward shaping. Using these ideas, RL has been applied to wide variety of robotic tasks such as stacking blocks in a tower [12, 48, 37], opening doors , flipping pan-cakes , hitting a ball, orienting a cube  and other dexterous manipulation tasks .
Developing data-efficient algorithms that can learn from sparse rewards will alleviate the need for demonstrations and painful reward design. It will consequently open up many application areas, where RL cannot be applied today. Past works have improved learning efficiency of RL algorithms using better optimization methods [26, 23, 51], combining model-based and model-free learning , hierarchical learning , and design of better exploration methods [50, 36, 39, 41, 52]. A few recent works used the compositional task structure to improve the data efficiency of RL algorithms [63, 61, 42].
In the related field of supervised deep learning, transfer of knowledge by pre-training on a source task followed by finetuning on a target task[14, 1, 46] has been very successful in reducing the data requirements. However, in the context of RL, learning from multiple tasks and transferring this knowledge to reduce data requirements for a new task remains an open challenge [65, 9, 29, 38]
. One potential reason for lack of transfer is that learning from a new task exacerbates the already existing problem of credit assignment. The inability to assign credit, in turn, increases the variance in the gradients and consequently results in learning failure. One solution is to pace the agent’s learning, where it only gets a new task when it has mastered previous tasks (i.e., curriculum learning ).
It turns out that in RL settings, curriculum learning is also not straightforward. To better understand why it is the case, consider solving the problem of stacking multiple blocks into a tower using a curriculum of stacking an increasing number of blocks. Suppose the agent has mastered the skill of stacking two blocks. The introduction of the third block preserves the task structure but changes the distribution of the agent’s input. In the absence of appropriate inductive biases to deal with changes in inputs, the agent resorts to treating the new data distribution as a new learning problem and is unable to leverage its knowledge from past tasks efficiently.
One well-known method to tackle changing data distribution is training with data-augmentation. In RL settings, this idea has translated into domain randomization 
. In the running example, training with randomization involves sampling a random number of blocks from a uniform distribution in every episode. However, because, in most episodes, the agent would be tasked to stack multiple blocks, learning in such a situation remains very challenging. This consideration suggests that the major hindrance in learning from a curriculum may not be in the design of the curriculum, but the inability of learning systems to transfer knowledge across the different tasks in the curriculum.
In this work, we show that training a policy represented by a attention based graph neural network (GNN) overcomes the challenges associated with curriculum learning in multi-object manipulation tasks. Our agent learns to stack six or more blocks from scratch (see FigureLABEL:fig:fig1). We use a simple curriculum strategy, which increases the number of blocks when the agent masters a target task with a fewer number of blocks. The attention-based GNN complements the curriculum by providing the appropriate inductive bias to transfer knowledge between tasks with a different number of objects. To the best of our knowledge, ours is the first work to solve the problem of stacking six or more blocks using RL and without requiring any expert demonstrations. Our method is orders of magnitude more efficient than the previous state-of-the-art method relying on human-provided demonstrations .
Furthermore, our system can build towers that are taller than the training time. It also succeeds at placing blocks in different configurations such as pyramids without any additional training (i.e., zero-shot generalization). While we present results on the task of stacking blocks in various arrangements, the approach developed in this work does not make any task-specific assumption and is therefore applicable to a wide range of tasks involving manipulation of multiple objects.
Our work is broadly related to techniques for scaling reinforcement learning algorithms to more complex robotic manipulation settings, as well as the use of relational and curricular inductive biases in machine learning.
Relational Inductive Bias: The use of relational inductive biases has a long history in reinforcement learning [30, 17, 58], and more broadly in logic and machine learning . Recently, there has been great interest in the use of Graph Neural Networks (GNNs) for representing graph data structures, which are especially suitable for object-oriented environments [8, 11, 31, 27, 59, 5]. In the context of RL, a key motivation for relational representation is to support a varying number of objects as inputs and to explicitly model relationships between objects. In the past, GNNs have been studied in context of learning and transferring policies for locomotion across agents with variable morphologies [61, 42].
Closest to our work is past research combining GNNs with policy learning for manipulation tasks. However these works either rely on tens or hundreds of thousands of expert demonstrations [16, 28] or exclusively show results on video games. Furthermore, while these works have considered GNNs to improve efficiency of solving a single task, we combine GNNs with learning from a curriculum of increasingly complex tasks to solve long-horizon manipulation problems that cannot be solved directly using current methods.
Curriculum Learning: Curriculum learning addresses the effect of data sampling strategies on learning, under the presumption that proper sampling of tasks can allow for more sample efficient learning and avoidance of local minima 
. In particular, prior work has shown that ordering tasks by heuristic measures of difficulty can be effective[6, 64]. A line of work has studied automatic discovery of curricula based on learning progress , adversarial self-play [52, 25], or backtracking . So far, these methods have not yielded curricula capable of automatically discovering tasks of the complexity we consider. In this paper, our contribution is not in proposing a new algorithm or heuristic for choosing the task curricula, but to demonstrate the graph-based representations can make use of a curriculum for learning complex tasks.
Block Stacking: Prior work on block stacking either heavily relied on human demonstrations [37, 15], or required significant reward engineering [48, 44], and/or carefully designed curriculum  of reaching, picking and placing blocks. Such design of curriculum and reward functions are hard problems with no known principled solutions. The work of  stacked blocks using a low-cost robot. However, they assumed the blocks were already picked and used a dense reward function. Other lines of work [33, 57] achieved impressive results on stacking objects, but relied on extensive human-defined knowledge of detecting keypoints or assuming access to physics simulation. In contrast, we present a simple but effective method for stacking blocks using RL that makes minimal assumptions about task structure or the environment.
Hierarchical Reinforcement Learning (HRL) aims to address the scaling and generalization problem in RL by decomposing problems into smaller subproblems. Examples of HRL frameworks include the “options" framework , feudal learning [60, 10] and the MaxQ framework . A key unsolved challenge is joint end-to-end learning of multiple levels of control, while avoiding degenerate solutions that lack hierarchical abstraction. Most successful instantiations of hierarchical RL make use of domain knowledge to construct a hierarchy . To our knowledge, no HRL algorithms have been successful at stacking tasks of the complexity we consider .
Figure LABEL:fig:fig1 shows our simulated robotic environment consisting of a 7-DoF Fetch robot arm equipped with a two-fingered parallel jaw gripper based on OpenAI’s . MuJoCo physics engine  was used for simulations. The robot is tasked to manipulate 1-9 blocks kept on a table. Each block is a cube with sides of 5cm. The robot’s action space is 4D, consisting of relative change in 3D position of its end-effector and a scalar value representing the distance between two fingers of the gripper.
Observations: The agent observes gripper features , including gripper velocity and position, and features representing N blocks. The block features are denoted by , where and is the feature representation of the
block. Each block is represented by a 15-D vector consisting of 3D position, 3D orientation expressed as Euler angles, 3D position relative to the gripper, 3D cartesian velocity and 3D angular velocity. The goal is expressed as set of 3D block positions, . The overall input to the agent is therefore . At the start of every episode, the initial block positions are randomly initialized on the table and the goal positions are sampled using a pre-determined distribution. The maximum length of every episode is steps, where is the number of blocks.
Reward: We use a step-wise sparse reward function where the robot is only rewarded when it places the block within a distance of from its desired goal location. The overall reward for placing blocks is given by: . We noticed that with this reward function, the robot learns to hold the top two blocks in its gripper instead of placing them and moving its hand away. To discourage this behavior, we added an additional term in the reward function to encourage the robot to move its hand away from the tower. This additional penalty was only provided when the hand was at a distance greater than from a “fully-stacked" tower. The overall reward is therefore given by, . Following , we set = 5cm, the size of each block.
A typical RL agent acts within an environment E
, modeled by a discrete-time Markov Decision Process (MDP) described by state space, action space , transition function , reward function , and discounting factor . The aim of the agent is to maximize the expected cumulative reward along states caused by a sequence of actions , by learning a suitable policy , i.e. .
A relatively efficient class of policy search algorithms is off-policy reinforcement learning. Q-learning  is a well known choice for off-policy learning, wherein the aim is to model the Q-function, i.e. . In principle, the optimal Q-function is found by solving the Bellman equation . In practice, we approximate the Q-function with a function approximator (i.e. a neural network) parameterized by by minimizing the Bellman error , where is an optimization constant that represents the weights of a slowly-updated "target" network.
While the above formulation is appropriate for a single goal, for solving multiple tasks, it is necessary to provide a task description as input [49, 1, 3]. Goal conditioned policies are expressed as , where represents the goal state. The learning problem is expressed as:
where goal is sampled from a goal distribution .
The central computation in a GNN is message passing between 1-hop vertices of a graph, performed by a graph-to-graph module. This module takes as input a variable-size vertex set and outputs an updated set , where is the number of vertices in the input graph. denote feature vectors of the node before and after a round of message passing. In each message passing round, each vertex sends a message to every other vertex. In attention-based GNNs, the incoming messages are weighted by a scalar coefficient (computed by attention) according to their relevance to the receiving vertex. The new feature representation of the vertex is the weighted sum of incoming messages. Message passing is typically performed multiple times. After message passing, the entire graph is represented as a fixed-sized embedding by pooling features across all vertices.
Mathematically, let the feature representation of the vertex at timestep be . In every message passing round, each vertex generates a query , key and a message using independently-parameterized functions , , and . Each vertex in the graph receives a message from all the vertices and computes it’s feature representation, , where are the attention weights and are computed as follows: .
We present a simple, but effective method for solving long-horizon, sparse reward tasks using reinforcement learning. Our core contribution is to equip the RL agent with inductive biases of relational reasoning in order to enable learning from a curriculum of tasks of increasing complexity. We use Soft-Actor Critic (SAC; 
) as our base learning algorithm because it is more robust to choice of hyperparameters and random seeds as compared to alternative off-policy learners such as DDPG. To use the same policy for multiple tasks, we modified SAC to be goal-conditioned [49, 1, 3]. For better sample efficiency, we also incorporated the idea of goal re-labelling via hindsight experience replay (HER; ). Details of SAC and HER can be found in the respective papers and are not directly relevant to our work. While we use SAC + HER for policy learning, our contributions are not specific to these algorithms and are applicable to any policy learning method.
We represent both the actor and critic in SAC using the graph neural network architecture described in Section IV-C. The various components of the GNN (
) use 64D linear layers. We use separate weights for each round of message passing and terminate the message passing after 3 rounds. We use a residual connection and layer normalization between the output of message passing roundand the input of message passing round to ease optimization. We call this agent architecture ReNN. We compare the performance of ReNN against the baseline system that constructs the actor and critic using four layers of 256D fully connected layers (referred to as MLP in rest of the paper).
Training Curriculum: We trained the robot to stack multiple blocks using three different curricula of tasks:
Direct: The robot was directly tasked to learn a policy to stack six blocks starting from scratch.
Uniform: At every episode, the number of blocks was uniformly sampled between 1 and 6.
Sequential: The robot was tasked to first pick and place a single block at goal positions that were uniformly and randomly chosen to be on the table or in the air. The robot then had to pick and place 2 blocks, where goal position of one block was sampled on the table and the goal position for the second block was sampled using the process described above. Thereafter, the robot was tasked with stacking blocks in a single tower configuration starting with 2 blocks. After the robot perfected stacking (N-1) blocks, it was given N blocks to stack. N was sequentially increased from 3 to 6. The transition points in this curriculum were manually chosen based on the success rates on stacking.
We evaluated the generalization of the policy trained for stacking a single tower by evaluating its performance on the following tests (see Appendix for visuals):
Single Tower: A single point was uniformly sampled on the table to serve as the base of a block tower. The goal positions of the blocks corresponded to translation along the z-axis from the base.
Multiple Towers: Few points were sampled on the table to serve as the base location of multiple towers. Each block was randomly assigned to a tower to produce towers of approximately equal height.
Pyramid: A uniformly sampled point on the table served as a corner point for pyramid configuration. Figure A.1 shows different Multiple Towers and Pyramid goal configurations for varying number of blocks.
We report performance of ReNN- Sequential (referred to as ReNN in later text) across three seeds. For other methods we report performance on a single seed. Success rate is reported as accuracy of completing a task averaged over 100 episodes. An episode is counted as successful when each block is within its goal position at the final time step.
|Task||Single Tower 4||Single Tower 5||Single Tower 6|
|Nair‘17 ||91% (850M)||50% (1000M)||32% (2300M)|
|Ours||93%4% (23M)||84%6% (27M)||75%4% (30M)|
Figure 1 shows that ReNN trained with the sequential curriculum (green line; section V) succeeds at stacking six blocks into a tower. Standard MLP architectures or ReNN trained to directly stack 6 blocks without the curriculum fail. Our experiments revealed that training with uniform curriculum was also insufficient. These results show that both ReNN and the sequential training are critical for success. To the best of our knowledge, ours is the first paper to show that is is possible to train a RL agent to stack six or more blocks in a tower after starting from scratch, without requiring expert demonstrations.
We report quantitative performance of our method and baselines in Table I. Our method achieves a success rate of 75% at stacking 6 blocks in 30 million timesteps. In comparison, the existing state-of-art method , that makes use of human demonstrations and resets, achieves only a success rate of 32% after over 2.3 billion timesteps. While the base learning algorithm used by  is DDPG + HER, in comparison to SAC + HER used by us, the orders of magnitude difference in performance cannot be attributed to the choice of using SAC instead of DDPG. We attempted to replicate results of  using SAC. However, we were unsuccessful at training SAC with behavior cloning due to the challenge in weighing the entropy term in SAC against the behavior cloning loss.
Careful analysis of Figure 1 reveals that there are several dips in performance as the training progresses. Many of the significant dips correspond to increase in task complexity to stack N+1 blocks, after stacking N
blocks. In most cases, the dip in performance is overcome after little additional experience. The only notable exception is the performance dip at 9M steps that corresponds to transitioning from 1 to 2 blocks. This was the first time the agent observed multiple objects. Additionally, We found that SAC converged faster, albeit with higher variance when it’s exploration was augmented to take a random action with probability of 0.1.
It is desirable to learn policies that are not only adept at the task they were trained on, but can be re-purposed for new and related tasks. If our ReNN architecture indeed provides a good inductive bias, then it should be possible solve different block configuration tasks with high-accuracy. To test this, we evaluated the performance of the learned policy, without any fine-tuning on previously unseen block configurations (i.e. zero-shot generalization) described in Section V-A. The results of this analysis are summarized in Figure 2.
Single Tower Evaluation: Figure 2 shows that a policy learned to stack blocks generalizes to stacking blocks without any training. The performance on stacking blocks, where drops significantly. One possible explanation is that it becomes progressively harder to stabilize larger number of blocks in a tower and the robot needs to substantially refine its strategy to stack more blocks. An analysis of failure modes is presented in Appendix B.
Multiple Towers Evaluation: The previous experiments tested generalization to a larger number of blocks, but on the same task. To test if the learned policy generalizes to new tasks, we evaluated the performance on stacking multiple towers instead of a single tower. Results in Figure 2(b) show that the agent trained for stacking a single tower of blocks can successfully stack multiple towers blocks. The performance again drops for . However, generalization to is better on the multiple towers task as compared to the single tower task. This suggest that while ReNN can generalize to a larger number of blocks than seen during training time, stacking a taller single tower without additional training is hard due to the difficulty of stabilizing a taller stack of blocks.
Pyramid Evaluation: To stress test our system further, we evaluated its performance on placing blocks in a pyramid configuration (see Figure A.1). Note that the robot never saw pyramids during training. Stacking blocks in pyramid is different than a tower, because now blocks may need to be balanced on two supporting blocks instead of only being stacked vertically. Figure 2(c) shows that our system is able to generalize and manipulate larger number of blocks than seen in training into pyramid configurations. Interestingly, the agent trained on Single Tower 4 performs better on the difficult Pyramid 5 and Pyramid 6 tasks than the agent trained on Single Tower 6. One possible explanation is that the agent trained on taller tower overfits to stacking blocks vertically, and is less able to stack blocks at an horizontal offset, which is useful for the pyramid task.
Emergent Strategies: The accompanying videos (https://richardrl.github.io/relational-rl) show that our agent automatically learns to push other blocks to grasp a particular block, grasps two blocks at a time and places them one by one to save time and other complex behaviors. These strategies emerge automatically as a consequence of optimizing a sparse reward function.
To the best of our knowledge, ours is the first work that reports such zero-shot generalization on the block stacking task using RL. At the same time, we acknowledge, there is substantial room for improving the zero-shot results and the stacking performance. Some future directions are described in Section VII.
In order to gain insights into why ReNN leads to faster convergence and better generalization, we visualized the attention patterns as the robot stacked six blocks (see Figure 3). The first row shows the key steps in tower stacking. Each column in second row is a matrix (). Each entry in the matrix, represents the normalized relevance score of the block on the features of the block (see defined in Section IV-C) computed by the final layer. It can be seen that maximum attention is to paid to the block that is to be placed and the attention on existing blocks in the stack decreases from the top-most to bottom-most block. Such attention pattern suggests that our system has learned to focus on the blocks most relevant to current block being placed. This is interesting because it suggests that ReNN has learned to decompose a complex problem into simpler sub-problems. We hypothesize that such decomposition is the reason why our system can learn from a task curricula and exhibits zero-shot generalization.
We have presented a framework for learning long-horizon, sparse reward tasks using deep reinforcement learning, relational graph architecture and curriculum learning. While we are orders of magnitude more sample efficient than the the existing state-of-the-art, our method would still require a few dozen robots (corresponding to our 35 workers) and several days (assuming each action takes .25 seconds) of real world training to achieve a comparable environment step complexity. And while block stacking is representative of long-horizon, multi-object manipulation tasks, it is important to scale our method to tasks involving more complicated object geometries and more granular manipulation.
In the current work, the curriculum is manually designed and based on the principle that smaller sets of objects are easier to learn to manipulate than larger sets of objects. However, more complicated and effective curricula could exist along axes of variation beyond just the object cardinality, and discovering these curricula automatically is an interesting direction for future research. One point of concern with relational architectures is that the computation time is quadratic in the number of entities. Developing computationally efficient methods is therefore important to scale these methods to environments with much larger numbers of objects. Finally, while we have presented results from state observation, in the future we would like to scale our system to work from visual and other sensory observations.
We acknowledge support from US Department of Defense, DARPA’s Machine Common Sense Grant and the BAIR and BDD industrial consortia. We thank Amazon Web Services (AWS) for their generous support in the form of cloud credits. We’d like to thank Vitchyr Pong, Kristian Hartikainen, Ashvin Nair and other members of the BAIR lab and the Improbable AI lab for helpful discussions during this project.
One-shot imitation learning. In Advances in neural information processing systems, pp. 1087–1098. Cited by: §II.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §I.
Exploration in model-based reinforcement learning by empirically estimating learning progress. In NIPS, Cited by: §I.