Learning Manipulation Skills Via Hierarchical Spatial Attention

04/19/2019 ∙ by Marcus Gualtieri, et al. ∙ Northeastern University 0

Learning generalizable skills in robotic manipulation has long been challenging due to real-world sized observation and action spaces. One method for addressing this problem is attention focus -- the robot learns where to attend its sensors and irrelevant details are ignored. However, these methods have largely not caught on due to the difficulty of learning a good attention policy and the added partial observability induced by a narrowed window of focus. This article addresses the first issue by constraining gazes to a spatial hierarchy. For the second issue, we identify a case where the partial observability induced by attention does not prevent Q-learning from finding an optimal policy. We conclude with real-robot experiments on challenging pick-place tasks demonstrating the applicability of the approach.



There are no comments yet.


page 9

page 11

page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Learning robotic manipulation has remained an active and challenging research area. This is because the real-world environments in which robots exist are large, dynamic, and complex. Partial observability – where the robot does not at once perceive the entire environment – is common and requires reasoning over past perceptions. Additionally, the ability to generalize to new situations is critical because, in the real world, new objects can appear in different places unexpectedly.

The particular problem addressed in this paper is the large space of possible robot observations and actions – how the robot processes its past and current perceptions to make high-dimensional decisions. Visual attention has long been suggested as a solution to this problem [1]. Focused perceptions can ignore irrelevant details, and generalization is improved by the elimination of the many irrelevant combinations of object arrangements [1]. Additionally, as we later show, attention can result in a substantial reduction to the number of actions that need considered. Indeed, when selecting position, the number of action choices can become logarithmic rather than linear in the volume of the robot’s workspace. In spite of these benefits, visual attention has largely not caught on due to (a) the additional burden of learning where to attend and (b) additional partial observability caused by the narrowed focus.

We address the first challenge – efficiently learning where to attend – by constraining the system to a spatial hierarchy of attention. On a high level this means the robot must first see a large part of the scene in low detail, select a position within that observation, and see the next observation in more detail at the position previously selected, and so on for a fixed number of gazes. We address the second challenge – partial observability induced by the narrowed focus – by identifying attention with a type of state-abstraction which preserves the ability to learn optimal policies with efficient reinforcement learning (RL) algorithms.

This article extends our prior work [2], wherein we introduced the hierarchical spatial attention (HSA) approach and demonstrated it on 3 challenging, 6-DoF, pick-place tasks. New additions include (a) faster training and inference times, (b) more ablation studies and comparisons to related work, (c) better understanding of when an optimal policy can be learned when using this approach, (d) longer time horizons, and (e) improved real-robot experimental results.

The rest of the paper is organized as follows. First is related work (Section II). Next, the general manipulation problem is described and the visual attention aspect is added (Sections III and IV-A). After that, the HSA constraints are added, and this approach is viewed as a generalization of earlier approaches (Section IV-B to IV-D). The bulk of the paper includes analysis and comparisons in 5 domains of increasing complexity (Section V). Real robot experiments are described close to the end (Sections V-D and V-E). Finally, we conclude with what we learned and future directions (Section VI).

Ii Related Work

This work is most related to robotic manipulation, reinforcement learning, and attention models. It is extends our prior research on 6-DoF pick-place

[2] and primarily builds on DQN [3] and Deictic Image Mapping [4].

Ii-a Learning Robotic Manipulation

Traditional approaches to robotic manipulation consider known objects – a model of every object to be manipulated is provided in advance [5, 6]

. While these systems can be quite robust in controlled environments, they encounter failures when the shapes of the objects differ from expected. Recent work has demonstrated grasping of novel objects by employing techniques intended to address the problem of generalization in machine learning

[7, 8, 9, 10, 11, 12, 13, 14, 15].

There have been attempts to extend novel object grasping to more complex tasks such as pick-place. However, these have assumed either fixed grasp choices [16] or fixed place choices [17]. The objective of the present work is to generalize these attempts – a single system that can find 6-DoF grasp and place poses.

Research contemporary with ours considers grasping and pushing novel objects to a target location [18]. Their approach is quite different: a predictive model of the environment is learned and used for planning, whereas we aim to learn a policy directly. Other work has considered the problem of domain transfer [19] and sparse rewards in RL [20]. We view these as complimentary ideas that could be combined with our approach for an improvement.

Ii-B Reinforcement Learning

Like several others, we apply RL techniques to the problem of robotic manipulation (see above-mentioned [9, 12, 14, 17, 20] and survey [21]). RL is appealing for robotic control for several reasons. First, several algorithms (e.g., [22, 23]) do not require a complete model of the environment. This is of particular relevance to robotics, where the environment is dynamic and difficult to describe exactly. Additionally, observations are often encoded as camera or depth sensor images. Deep Q-Networks (DQN) demonstrated an agent learning difficult tasks (Atari games) where observations were image sequences and actions were discrete [3]. An alternative to DQN that can handle continuous action spaces are actor-critic methods like DDPG [24]. Finally, RL – which has its roots in optimal control – provides tools for the analysis of learning optimal behavior (e.g. [25, 26, 27]), which we refer to in Section V-A.

Ii-C Attention Models

Our approach is inspired by models of visual attention. Following the early work of Whitehead and Ballard [1], we distinguish overt actions (which directly affect change to the environment) from perceptual actions (which retrieve information). Similar to their agent model, our abstract robot has a virtual sensor which can be used to focus attention on task-relevant parts of the scene. The present work updates their methodology to address more realistic problems, and we extend their analysis by describing a situation where an optimal policy can be learned even in the presence of “perceptual aliasing” (i.e. partial observability).

Attention mechanisms have also been used with artificial neural networks to identify an object of interest in a 2D image

[28, 29, 30, 31]. Our situation is more complex in that we identify 6-DoF poses of the robot’s hand. Improved grasp performance has been observed by active control of the robot’s sensor [32, 33]. These methods attempt to identify the best sensor placement for grasp success. In contrast, our robot learns to control a virtual sensor for the purpose of reducing the complexity of action selection and learning.

Iii Problem Statement

Problems considered herein can be described as learning to control a move-effect system (Fig. 1, cf. [4]):

Definition 1 (Move-Effect System).

A move-effect system is a discrete time system consisting of a robot, equipped with a depth sensor and end effector (e.e.), and rigid objects of various shapes and configurations. The robot perceives a history of point clouds , where is acquired by the depth sensor; an e.e. status, ; and a reward . The robot’s action is , where is the pose of the e.e., is the robot’s workspace, and is a preprogrammed controller for the e.e. For each stage , the robot receives a new perception and takes an action.

The reward is usually instrumented by the system engineer to indicate progress toward completion of some desired task. The robot initially has no knowledge of the system’s state transition dynamics. The objective is, by experiencing a sequence of episodes, for the robot to learn a policy – a mapping from observations to actions – which maximizes the expected sum of per-episode rewards.

Fig. 1: The move-effect system. The robot has an e.e. which can be moved to pose to perform operation .

For example, suppose the e.e. is a 2-fingered gripper, , , the objects are bottles and coasters, and the task is to place all the bottles on the coasters. The reward could be for placing a bottle on a coaster, for removing a placed bottle, and otherwise.

Iv Approach

Our approach has two parts. The first part is to reformulate the problem as a Markov decision process (MDP) with abstract states and actions (Section 

IV-A). With this reformulation, the resulting state representation is substantially reduced, and it becomes possible for the robot to learn to restrict attention to task-relevant parts of the scene. The second part is to add constraints to the actions such that the e.e. pose is decided sequentially (Section IV-B). After these improvements, the problem is then amenable to solution via standard RL algorithms like DQN (examples in Section V).

Iv-a Sense-Move-Effect MDP

The sense-move-effect system adds a controllable, virtual sensor which perceives a portion of the point cloud from a parameterizable perspective (Fig. 2).

Definition 2 (Sense-Move-Effect System).

A sense-move-effect system is a move-effect system where the robot’s actions are augmented with (where and ) and the point cloud observations are replaced with a history of images, (where ). The action has the effect of adding to the history.111 is orthographic projections of points onto , images. Each image plane is positioned at the origin with a different orientation. Image values are the point to plane distance, ambiguities resolved with the nearest distance. returns the points of which lie inside a rectangular volume situated at the origin with length, width, height . expresses (initially expressed w.rt. the world frame) w.r.t. .

Fig. 2: The sense-move-effect system adds a virtual, mobile sensor which observes points in a rectangular volume at pose with size .

The action makes it possible for the robot to get either a compact overview of the scene or to attend to a small part of the scene in detail. Since the resolution of the images is fixed, large values of correspond to seeing more objects in less detail, and small values of correspond to seeing less objects in more detail.

The robot’s memory need not include the last images – it can include any previous images selected according to a predetermined strategy. Because the environment only changes after actions, we keep the latest image, , and the last images that appeared just before actions. Fig. 3 shows an example in the bottles on coasters domain.

Fig. 3: Scene and observed images for and . Left. Scene’s initial appearance. Left center. image (large ) just before . Right center. Scene’s current appearance. Right. Current image, focused on the coaster (small ).

In order to apply standard RL algorithms to the problem of learning to control a sense-move-effect system, we define the sense-move-effect MDP.

Definition 3 (Sense-Move-Effect MDP).

Given a sense-move-effect system, a reward function, and transition dynamics, a sense-move-effect MDP is a finite horizon MDP where states are sense-move-effect system observations and actions are sense-move-effect system actions.

The reward function and transition details are task and domain specific, respectively, examples of which are given in Section V.

Iv-B Hierarchical Spatial Attention

The observation is now similar to that of DQN – a short history of images plus the e.e. status – and can be used by a Q-network to approximate Q-values. However, the action space remains large due to the 6-DoF choice for or and the 3-DoF choice for . Additionally, it may take a long time for the robot to learn which actions result in useful observations. To remedy both issues, we design constraints to the sense-move-effect actions.

Definition 4 (Hierarchical Spatial Attention).

Given a sense-move-effect system, , , and the list of pairs , (where and ), hierarchical spatial attention (HSA) constrains the robot to take actions, with for , prior to each action. Furthermore, the first sensor pose in this sequence must be ; the sensor poses , for , must be offset no more than from ; and e.e. pose must be offset no more than of .222Concretely, indicates a position offset of , , and and rotation offset of , , and .

The process is thus divided into overt stages, where, for each stage, actions are followed by 1 action (Fig. 4). The constraints should be set such that the observation size and offset decrease as increases, so the point cloud under observation decreases in size, and the volume within which the e.e. pose can be selected is also decreasing. These constraints are called hierarchical spatial attention because the robot is forced to learn to attend to a small part of the scene (e.g., Fig. 5).

Fig. 4: Initially, the state is empty. Then, sense actions are taken, at each point the latest image is state. After this, the robot takes 1 action. Then, the process repeats, but with the last image before saved to memory.
Fig. 5: HSA applied to grasping in the bottles on coasters domain (Section V-D). There are 4 levels (i.e. ). The sensor’s volume size is cm for level 1, cm for levels 2 and 3, and cm for level 4. As indicated by blue squares, constrains position in the range of cm for level 1, cm for level 2, and cm for level 3. Orientation is selected for level 4 in the range of about the hand approach axis. Red crosses indicate the position selected by the robot, and the red circle indicates the angle selected by the robot. Positions are sampled uniformly on a grid and 60 orientations are uniformly sampled. Pixel values normalized and height selection not shown for improved visualization.

To see how HSA can improve action sample efficiency, consider the problem of selecting position in a 3D volume. Let be the largest volume allowed per sample. With naive sampling, the required number of samples is proportional to the workspace volume , i.e., . But with HSA, we select position sequentially, by say, halving the volume size in each direction at each step, i.e., . In this case samples are needed – 1 for each part of the divided space at each step, and the volume under focus is . To get , , so the sample complexity becomes logarithmic, rather than linear, in .

Iv-C Lookahead Sense-Move-Effect

So far we have not specified how action parameters , , and are encoded. For standard sense-move-effect, these are typically encoded as 6 floating point numbers representing the pose and 3 floating point numbers representing the volume size . Alternatively, the pair could be encoded as the sense image that would be seen if the sensor were to move to pose with zoom . This is as if the action was “looking ahead” at the pose the sensor or e.e. would move to if this action were selected.

In particular, the lookahead sense-move-effect MDP has actions and , the difference being the additional parameter for . The action samples are encoded as the height map that would be generated by . Because action has this rich encoding, state is just the e.e. status and a history of actions.

The HSA constraints for the lookahead variant have the same parameterization – an initial pose and a list of pairs . The semantics are slightly different. for is the parameter for the th , and is the parameter. The for specify the offset of the action samples relative to the last pose decided, . specifies the offset of relative to .

Iv-D Relation to Other Approaches in the Literature

Iv-D1 Dqn

Consider a sense-move-effect MDP with HSA constraints , centered in the robot’s workspace, and and large enough to capture the entire workspace. The only free action parameters for this system are the e.e. pose, which is sampled uniformly and spaced appropriately for the task, and the e.e. operation. In this case, the observations and actions are similar to that of DQN [3], and the DQN algorithm can be applied to the resulting MDP.

However, this approach is problematic in robotics because the required number of action samples is large, and the image resolution would need to be high in order to capture the required details of the scene. For example, a pick-place task where e.e. poses are in , the robot workspace is 1 m, the required position precision is 1 mm, and the required orientation resolution is 1 per Euler angle requires on the order of actions. Adding more levels (i.e. ) alleviates this problem.

Iv-D2 Deictic Image Mapping

With , centered in the robot’s workspace, the deictic marker size (e.g., the size of the largest object to be manipulated), and large enough to capture the entire workspace, HSA applied to the lookahead sense-move-effect MDP is the Deictic Image Mapping representation [4]. Similar to the case with DQN, if the space of e.e. poses is large, and precise positioning is needed, many actions need to be sampled. In fact, the computational burden with the Deictic Image Mapping representation is even larger than that of DQN due to the need to create images for each action. Yet, the deictic representation has significant advantages over DQN in terms of efficient learning due to its small observations [4].

HSA generalizes and improves upon both DQN and Deictic Image Mapping by overcoming the burden for the agent to select from many actions in a single time step. Instead, the agent sequentially refines its choice of e.e. pose over a sequence of decisions. We provide comparisons between these approaches in the following sections.

V Application Domains

In this section we compare the HSA approach in 5 application domains of increasing complexity. The complexity increases in terms of the size of the action space and in terms of the diversity of object poses and geometries. We analyze simpler domains because the results are more interpretable and learning is faster (Table I). More complex domains are included to demonstrate the practicality of the approach. All training is in simulation, and Sections V-D and V-E include test results for a psychical robotic system.

Tabular Pegs on Disks Upright Pegs on Disks Pegs on Disks Bottles on Coasters 6-Dof Pick-Place
Time (hours) 0.23 1.55 3.36 8.12 96.54
TABLE I: Average simulation time for the 5 test domains. Times are averaged over 10 or more simulations on 4 different workstations, each equipped with an Intel Core i7 processor and an NVIDIA GTX 1080 graphics card.

V-a Tabular Pegs on Disks

Here we analyze the HSA approach applied to a simple, tabular domain, where the number of states and actions is finite. The domain consists of 2 types of objects – pegs and disks – which are situated on a 3D grid (Fig. 6). The robot can move its e.e. to a location on the grid and open/close its gripper. The goal is for the robot to place all the pegs onto disks.

Fig. 6: Tabular pegs on disks with an grid, 1 peg (red triangle), and 1 disk (blue circle).

If this problem is described as a finite MDP, eventual convergence to the optimal policy is guaranteed for standard RL algorithms [25, 26]. However, the number of state-action pairs is too large for practical implementation unless some abstraction is applied. The main question addressed here is if convergence guarantees are maintained with the HSA abstraction.

V-A1 Ground MDP

Tabular pegs on disks is first described without the sense-move-effect abstraction.

  • [leftmargin=*]

  • State. A set of pegs , a set of disks , and the current time . A peg (resp. disk) is a location except peg locations are augmented with a special in-hand location . Pegs (resp. disks) cannot occupy the same location at the same time, but 1 peg and 1 disk can occupy the same location at the same time.

  • Action. , which moves the e.e. to and opens/closes. It opens if a peg is located at and closes otherwise.

  • Transition. increments by 1. If no peg is at and a peg is at the action location, then the peg is grasped (). If a peg is located at and the action location does not contain a peg, the peg is placed (). Otherwise, the state remains unchanged.

  • Reward. 1 if a peg is placed on an unoccupied disk, -1 if a placed peg is removed, and 0 otherwise.

Initially, pegs and disks are at distinct locations, and no peg is in the e.e. The time horizon is , where there is enough time to grasp and place each object. This MDP satisfies the Markov property because the next state is completely determined from the current state and action. The number of possible states is shown in Eq. 1, and the number of actions is . It is not practical to learn the optimal policy by enumerating all state-action pairs for this MDP: for example, if and , the state-action value lookup table size is on the order of .


V-A2 Sense-Move-Effect MDP

We apply the sense-move-effect abstraction of Section IV-A and HSA constraints of Section IV-B to the tabular pegs on disks problem. The process is illustrated in Fig. 7. At level 1, the sensor perceives the entire grid as 8 cells, each cell summarizing the contents of an octant of space in the underlying grid. The robot then selects one of these cells to attend to. At levels , the sensor perceives 8 cells revealing more detail of the octant selected in the previous level. At level , the sensor perceives 8 cells in the underlying grid, and the location of the underlying action is determined by the cell selected here. Without loss of generality, assume the grid size of the ground MDP is a power of 2 and the number of levels is .

Fig. 7: HSA applied to the gird in Fig. 6. Columns correspond to levels 1, 2, and 3. The observed volume appears yellow, and the octant selected by the robot appears green. Top. Robot selects the peg and is holding it afterward. Bottom. Robot selects the disk.
  • [leftmargin=*]

  • State. The current level , the overt time step , a bit indicating if a peg is held, and the tuple where each . indicates the presence of unplaced pegs, unoccupied disks, placed pegs, and empty space.

  • Action. The action is , a location in the observed grids.

  • Transition. For levels , the robot selects a cell in which corresponds to some partition of space in the underlying grid. The sensor perceives this part of the underlying grid and generates the observation at level . For level , the selections determine the location of the underlying action, is reset to 1, and otherwise the transition is the same as in the ground MDP.

  • Reward. The reward is 0 for levels . Otherwise, the reward is the same as for the ground MDP.

The above process is no longer Markov because a history of states and actions could be used to better predict the next state. For instance, for a sufficiently long random walk, the exact location of all pegs and disks could be determined from the history of observations, and the underlying grid could be reconstructed.

On the other hand, this abstraction results in substantial savings in terms of the number of states (Eq. 2) and actions (). The only nonconstant term (besides ) is logarithmic in . Referring to the earlier example with and , the state-action lookup table size is the order of .


V-A3 Theoretical Results

The sense-move-effect MDP with HSA constraints can be classified according to the state abstraction ordering defined in Li et al.

[27]. In particular, we show -irrelevance, which is sufficient for the convergence of a number of RL algorithms, including -learning, to a policy optimal in the ground MDP.

Definition 5 (-irrelevance Abstraction [27]).

Given an MDP , any states , and an arbitrary but fixed weighting function , a -irrelevance abstraction is such that for any action , implies .

is a mapping from ground states to abstract states and defines the abstract MDP.333Although the definition is for infinite-horizon problems (due to ), our finite-horizon problem readily converts to an infinite-horizon problem by adding an absorbing state that is reached after overt stages. The weight

is the probability the underlying state is

given its abstract state is observed. Any fixed policy, e.g. -greedy with fixed , induces a valid and satisfies the definition.

Theorem 1 (Convergence of Q-learning under -irrelevance [27]).

Assume that each state-action pair is visited infinitely often and the step-size parameters decay appropriately. Q-learning with abstraction converges to the optimal state-action value function in the ground MDP. Therefore, the resulting optimal abstract policy is also optimal in the ground MDP.

Because Li et al. do not consider action abstractions, we redefine the ground MDP to have the same actions as sense-move-effect MDP. Additionally, to keep the ground MDP Markov, we add the current level , and the current point of focus , to the state. This does not essentially change the tabular pegs on disks domain but merely allows us to rigourously make the following connection.

Let states and actions of the ground MDP be denoted by and respectively. Similarly, let states and actions of the sense-move-effect MDP be denoted by and respectively. Let be the observation function.

Theorem 2 ( is -irrelevant).

The sense-move-effect abstraction, , is a -irrelevance abstraction.


can be computed from and . The reward after the current overt stage depends on , whether or not it is possible to select a peg/disk, and whether or not it is possible to avoid selecting a placed peg. These are known from and . Furthermore, whether or not a peg will be held after the current stage can be determined from and . Finally, due to and the fact that all pegs are initially unplaced, the sum of future rewards following an optimal policy from the current stage depends only on (a) whether or not a peg will be held after the current stage and (b) the amount of time left, . ∎

V-A4 Simulation Results

The proof to Theorem 2 suggests that the observability of pegs, disks, placed pegs, and empty space are all important for learning the optimal policy. On the other hand, we empirically found no disadvantage to removing the (placed pegs) and (empty space) grids. However, it is important to distinguish unplaced pegs and placed pegs. Fig. 8 shows learning curves for an HSA agent with and grids versus an HSA agent with the same grids but showing pegs/disks irregardless of whether or not they are placed/occupied.

Fig. 8: Number of objects placed for the HSA agent (blue) and the agent with a faulty sensor (red). Curves are first mean and over each episode in 30 realizations, then averaged over -epsisode segments for visualization. There are objects, and the grid size is .

Lookahead HSA and Deictic Image Mapping variants (Section IV-C and IV-D) result in an even smaller state-action space than standard HSA. In the tabular domain, this means faster convergence (Fig. 9). Although the deictic representation seems superior in these results, it has a serious drawback. The action-selection time scales linearly with because there is one action for each cell in the underlying grid. The lookahead variant captures the best of both worlds – small representation and fast execution. Thus, in the tabular domain, lookahead appears to be the satisfactory middle ground between the two approaches. However, for the more complex domains, where Q-function approximation is required, the constant time needed to generate the action images becomes more significant, and the advantage of lookahead in terms of episodes to train diminishes. The latter may be because the function approximator (a CNN) has built-in position equivariance, taking away some of the advantage from the lookahead representation.

Fig. 9: Number of objects placed for standard (blue), lookahead (red), and deictic (yellow) agents. Curves are mean (solid) and (shaded) over 30 realizations. Plot in log scale for the lookahead and deictic results to be visible.

In each of these experiments, the learning algorithm was Sarsa [23]

, and actions were taken greedily w.r.t. the current Q-estimate. An optimistic initialization of action-values and random tie-breaking was relied on for exploration. Under these conditions, Q-learning produces the same learning curves as Sarsa.

V-B Upright Pegs on Disks

In this domain, pegs and disks are modeled as tall and flat cylinders, respectively, where the cylinder axis is always vertical (Fig. 10, left). Unlike the tabular domain, object size and position are sampled from a continuous space. Grasp and place success are checked with a set of simple conditions appropriate for upright cylinders.444Grasp conditions: gripper is collision-free and the top-center of exactly 1 cylinder is in the gripper’s closing region. Place conditions: entire cylinder is above an unoccupied disk and the cylinder bottom is at most 1 cm below or 2 cm above the disk surface. The reward is 1 for grasping an unplaced peg, -1 for grasping a placed peg, 1 for placing a peg on an unoccupied disk, and 0 otherwise.

Observations consist of 1 or 2 images (, , ); the current HSA level, ; and the e.e. status, . Each HSA level selects position (Fig. 10, right). Gripper orientation is not critical for this problem.

Fig. 10: Left. Example upright pegs on disks scene. Right. Level 1, 2, and 3 images for grasping the orange peg. Red cross denotes the position selected by the robot and the blue rectangle denotes the allowed offset. cm for level 1 and cm for levels 2 and 3. cm for level 1, cm for level 2, and cm for level 3. Pixel values normalized and height selection not shown for improved visualization.

V-B1 Network Architecture and Algorithm

The Q-function consists of 6 convolutional neural networks (CNNs), 1 for each level and e.e. status, with identical architecture (Table 

II). This architecture results in faster execution time compared with our previous version [2]. The loss is the squared difference between the predicted and the actual action-value target, averaged over a mini-batch. The action-value target is the reward received at the end of the current overt stage.555With standard Monte Carlo (MC) and , the action-value target would be the sum of rewards received after the current time step [34]. Since, for this problem, no positively rewarding grasp precludes a positively rewarding place, ignoring rewards after the current overt stage is acceptable. For CNN optimization, Adam [35] is used with a base learning rate of , weight decay of , and mini-batch size of 64.

layer kernel size stride output size
conv-1 2
conv-2 2
conv-3 2
conv-4 1

CNN architecture for the upright pegs on disks domain. Each layer besides conv-4 has a Rectified linear unit (ReLU) as the activation.

V-B2 Simulation Results

We tested the standard HSA approach with 1, 2, and 3 levels. The number of actions (CNN outputs) were set to have the same precision in positioning of the e.e.: 1 level had outputs, 2 levels outputs and outputs, and 3 levels each had outputs. Note that with 1 level this is comparable to the DQN (i.e. no-hierarchy) approach (Section IV-D). Exploration was -greedy with set such that the probability of taking 1 random action in an overt stage was the same: 0.04 for 3 levels, 0.059 for 2 levels, and 0.115 for 1 level.

Results are shown in Fig. 11. The 1 and 3 level cases perform similarly in the end, with faster learning in terms of episodes for the 1 level case. However, execution is slower for 1 level due to the large number of actions that need to be processed. We were not able to determine why 2 levels performs worse than 1 or 3 levels.

Fig. 11: HSA with varying number of levels, . (Blue) , (red) , and (yellow) . Curves are mean over 10 realizations then averaged over episode segments.

We tested standard HSA and lookahead HSA with 3 levels. Lookahead HSA with 1 level would be the Deictic Image Mapping approach (Section IV-D), but computation of all images was prohibitively expensive. For this comparison, the number of actions was increased to per level (thus the improvement compared to Fig. 11). Results are shown in Fig. 12. As in the tabular domain, lookahead learns faster in terms of the number of episodes; however, the difference is less pronounced as before, and the execution time for standard HSA is less than half that of lookahead.

Fig. 12: Standard HSA (blue) versus lookahead HSA (red).

V-C Pegs on Disks

The pegs on disks domain is the same as the upright pegs on disks domain except the cylinder axis of pegs can be either horizontal or vertical, and a correct placement requires the peg axis to be vertical (Fig. 13).666Grasp conditions: gripper collision-free, exactly 1 peg intersects gripper closing region, peg axis intersects gripper closing region, fingers contact sides or ends (but not 1 of each). Place conditions: peg is vertical, entirely over an unoccupied disk, bottom cm of disk surface, and gripper not in collision. This seemingly innocuous change has important consequences in terms of problem complexity. First, the orientation of the gripper is important for both grasping and placing. Second, it is possible to grasp the peg by its ends, thereby preventing the possibility of a successful place in the next stage. Thus, the robot needs to reason longer-term about the consequences of its actions.

We examine 2 reward functions: the first is the same as with upright pegs on disks, and the second is the same except for a reward of 0 for a stable grasp where the fingers are on the cylinder ends. The latter is called the “shaped grasp reward function” and requires less long-term reasoning because grasps which cannot result in successful placements are penalized by the reward received after the current overt stage.

Fig. 13: Example pegs on disks scene, where the 1st peg is correctly grasped and placed.

Observations are similar to before except now the image resolution is higher (), and the current overt time step is included whenever the grasp reward is unshaped. HSA has 3 levels selecting position and 1 level selecting orientation about the gripper approach axis (Fig. 14).

Fig. 14: HSA for grasping the peg in Fig. 13. cm for level 1 and cm for levels 2-4. cm for level 1, cm for level 2, and cm for level 3. Orientation for grasp level 4 varies in about the gripper approach axis and for place level 4 about the finger-closing axis. Pixel values normalized and height selection not shown for improved visualization.

V-C1 Network Architecture and Algorithm

The network architecture is described in Table III. Weight decay changed to 0. When shaping grasp reward, Q-estimate targets are the sum of rewards for the current stage. When not shaping grasp reward, Q-estimate targets are computed via standard Monte Carlo (MC) [34] except the discount is accumulated only after the end of an overt stage (instead of after every time step).

layer kernel size stride output size
conv-1 2
conv-2 2
conv-3 2
conv-4 / ip-1 / - 1 / -
TABLE III: CNN architecture for the pegs on disks domain. Each layer besides the last has a ReLU activation. The last layer is a convolution layer for levels 1-3 (selecting position) and an inner product (IP) layer for level 4 (selecting orientation). for grasp networks and for place networks.

V-C2 Simulation Results

The first question is if the robot can learn to perform long-term reasoning as well as it can learn to perform short-term reasoning. Tests are run with and without shaping the grasp reward (Fig. 15). For the unshaped case, we tested 7 different discount factors using the same random seed and chose the best one, . With reward shaping, learning converges faster and ultimately a better policy is learned.

Fig. 15: Average number of objects placed when (blue) shaping the grasp reward and (red) not shaping grasp reward. Curves are mean over 10 realizations then averaged over episode segments.

The second question is how long-term reasoning is affected by the choice of RL algorithm. Here we test Monte Carlo [34], Sarsa [23], and Q-learning [25] (Fig. 16).777Specifically, 4-step Sarsa and Q-learning were used [34], which performed better than 1-step versions in preliminary tests. The discount factor is increased to to emphasize the difference between the methods. Q-learning and Sarsa do indeed learn faster in terms of episodes, but the end improvement is marginal. Note that MC executes faster, since the Q-function is not consulted when generating action-value targets.

Fig. 16: Pegs on disks comparison with MC (blue), Sarsa (red), and Q-learning (yellow).

V-D Bottles on Coasters

The main question addressed here is if HSA can be applied to a practical problem and implemented on a physical robotic system. The bottles on coasters domain is similar to the pegs on disks domain, but now objects have complex shapes and are required to be placed upright.888Grasp conditions: gripper closing region intersects exactly 1 object and the antipodal condition from [11] with friction cone. Place conditions: bottle is upright, center of mass (CoM) position at least 2 cm inside an unoccupied coaster, and bottom within cm of coaster surface. The reward is for grasping an unplaced object more than 4 cm from the bottom (bottom grasps are kinematically infeasible in the physical system), for grasping a placed object, for placing a bottle, and 0 otherwise.

Observations are the same as in the pegs on disks domain except the overt time step is always input to grasp networks (and never input to place networks). The 4-level hierarchy is illustrated in Fig. 5.

V-D1 Network Architecture and Algorithm

The network architecture is the same as in Table III except the number of kernels for conv-1 and conv-3 were increased to 64. Weight decay is again 0. Q-network targets are the reward after the current overt stage.

V-D2 Simulation Results

70 bottles from 3DNet [36] were randomly scaled to height 10-20 cm. Bottles were placed upright with probability and on their sides with probability . Learning curves for 2 bottles and 2 coasters are shown in Fig. 17. Performance is somewhat worse than that of the pegs on disks domain, reflective of the additional problem complexity.

Fig. 17: Number of bottles grasped (blue) and placed (red). Curves are mean over 10 realizations then averaged over episode segments.

V-D3 Top- Sampling

Before considering experiments on a physical robotic system, we address an important assumption of the move-effect system of Section III. The assumption is the e.e. can move to any pose, , in the robot’s workspace. Recent advances in motion planning algorithms make this a reasonable assumption for the most part; nonetheless, a pose can still sometimes be unreachable due to obstacles, motion planning failure, or IK failure.

To address this issue, multiple, high-valued actions are sampled from the policy learned in simulation. In particular, for each level of an overt stage, we take the top- samples according to Eq. 3, where is the action-value estimate at level , is the maximum possible action-value, is the minimum possible action-value, and .


Preliminary tests in simulation showed sampling top- values performs better than sampling top- values, as was done previously [2].

During test time, the resulting , samples are checked for IK and motion plan solution in descending order of value. As increases, the probability of failing to find a reachable e.e. pose decreases; however, the more poses that are unreachable, the lower the value. Thus, when designing an HSA system, it is important to not over constrain the space of actions.

V-D4 Robot Experiments

We tested the bottles on coasters task with the physical system depicted in Fig. 18. The system consists of a Universal Robots 5 (UR5) arm, a Robotiq 85 parallel-jaw gripper, and a Structure depth sensor. The test objects (Fig.19) were not observed during training. The CNN weight files had about average performance out of the 10 realizations (Fig. 17).

Fig. 18: Test setup for bottles on coasters task: a UR5 arm, Robotiq 85 gripper, Structure depth sensor (mounted out of view above the table and looking down), 2 bottles, and 2 coasters.

Initially, 2 coasters were randomly selected and placed in arbitrary positions in the back half of the robot’s workspace (too close resulted in unreachable places). Then, 2 bottles were randomly selected and placed upright with probability and on the side with probability . The bottles were not allowed to be placed over a coaster.999Python’s pseudorandom number generator was used to decide the objects used and upright/side placement. Object position was decided by a human instructed to make the scenes diverse. Top- sampling with was used. A threshold was set for the final grasp/place approach, whereby, if the magnitude of the force on the arm exceeded this threshold, the motion canceled and the open/close action was immediately performed.

Fig. 19: Test objects used in UR5 experiments.

Results are summarized in Table IV, and a successful sequence is depicted in Fig. 20. A grasp was considered successful if a bottle was lifted to the “home” configuration; a place was considered successful if a bottle was placed upright on an unoccupied coaster and remained there after the gripper withdrew. Failures were: grasped a placed object (), placed too close to the edge of a coaster and fell over (), placed upside-down (), object slip in hand after grasp caused a place failure (), and object fell out of hand after grasp ().

Fig. 20: Successful trial – all bottles placed in 4 overt stages. Image taken immediately after open/close.
Grasp Place
Attempts 60 59
Success Rate 0.98 0.90
Number of Objects 1.97 1.67
TABLE IV: Performance for UR5 experiments placing 2 bottles on 2 coasters averaged over 30 episodes. Task success rate with was 0.67.

V-E 6-DoF Pick-Place

The HSA method was also implemented for 6-DoF manipulation, and the same system was tested on 3 different pick-place tasks [2].101010This section refers to an earlier version of our system, so the simulations took longer and the success rates for bottles are lower. The setup was similar to that in Fig. 18 except the sensor was mounted to the wrist. See [2] for more details. The tasks included stacking a block on top of another, placing a mug upright on the table, and (similar to Section V-D) placing a bottle on a coaster. All tasks included novel objects in light to moderate clutter (Fig. 21). To handle perceptual ambiguities in mugs, the observations were 3-channel images (, , ) projected from a point cloud obtained from 2 camera poses. HSA included 6 levels () – 3 for position and 1 for each Euler angle. Results from UR5 experiments are shown in Table V.

Fig. 21: 6-DoF pick place on the UR5 system. Top. Blocks task. Bottom. Mugs task. Notice the grasp is diagonal to the mug axis, and the robot compensates for this by placing diagonally with respect to the table surface.
Blocks Mugs Bottles
Grasp 0.96 0.86 0.89
Place 0.67 0.89 0.64
Task 0.64 0.76 0.57
Grasps 50 51 53
Places 48 44 47
TABLE V: Top. Grasp, place, and task success rates for the 3 tasks with (i.e., 1 pick 1 place). Bottom. Number of grasp and place attempts.

Vi Conclusion

The primary conclusion is that the sense-move-effect abstraction, when coupled with hierarchical spatial attention, is an effective way of simultaneously handling (a) high-resolution 3D observations and (b) high-dimensional, continuous action spaces. These two issues are intrinsic to realistic problems of robot learning. We provide several other considerations relevant to systems employing spatial attention:

Vi-a Secondary Conclusions

  • [leftmargin=*]

  • Compared to a flat representation, HSA can result in an exponential reduction in the number of actions that need to be sampled (Section IV-B).

  • HSA generalizes DQN, and lookahead HSA generalizes Deictic Image Mapping (Section IV-D).

  • The partial observability induced by an HSA observation does not preclude learning an optimal policy (Section V-A).

  • HSA may take longer to learn than DQN in terms of the number of episodes to convergence, but HSA executes faster when the number of actions is large (Section V-B).

  • Lookahead HSA is preferred to standard HSA in terms of the number of the episodes to train, but execution time is longer by a constant and the learning benefit diminishes when coupled with function approximation (Sections V-A and V-B).

  • Reward shaping can improve learning performance when there are long-term consequences of actions (Section V-C).

  • HSA can be applied to realistic problems on a physical robotic system (Sections V-D and V-E).

Vi-B Limitations and Future Work

With our approach, the system designer still needs to select HSA parameters and possibly a reward function. In principle, these could also be learned by the robot, but this would likely require more training. Another concern is, even for some of the simpler problems, an optimal policy was not learned in simulation. To some extent this may be remedied by top- sampling. (Notice the real robot results in Section V-D are better than the simulation results.) However, we would like to understand and address this problem more rigorously. Finally, how manipulation skills can be automatically and efficiently transferred to different but related tasks remains an interesting, open question. Even small changes to the task, such as including distractor objects, requires complete retraining of the system for maximum performance.


We thank Andreas ten Pas for reviewing an early draft of this paper and the anonymous reviewers of [2] for their insightful comments.


  • [1] S. Whitehead and D. Ballard, “Learning to perceive and act by trial and error,” Machine Learning, vol. 7, no. 1, pp. 45–83, 1991.
  • [2] M. Gualtieri and R. Platt, “Learning 6-DoF grasping and pick-place using attention focus,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87, Oct 2018, pp. 477–486.
  • [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [4] R. Platt, C. Kohler, and M. Gualtieri, “Deictic image maps: An abstraction for learning pose invariant manipulation policies,” in

    AAAI Conf. on Artificial Intelligence

    , 2019.
  • [5] T. Lozano-Pérez, “Motion planning and the design of orienting devices for vibratory part feeders,” IEEE Journal Of Robotics And Automation, 1986.
  • [6] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield, “Deep object pose estimation for semantic robotic grasping of household objects,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87.   PMLR, Oct 2018, pp. 306–316.
  • [7]

    I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,”

    The Int’l Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015.
  • [8]

    L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in

    IEEE Int’l Conf. on Robotics and Automation, 2016, pp. 3406–3413.
  • [9] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with large-scale data collection,” in Int’l Symp. on Experimental Robotics.   Springer, 2016, pp. 173–184.
  • [10] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” Robotics: Science and Systems, vol. 13, 2017.
  • [11] A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp pose detection in point clouds,” The Int’l Journal of Robotics Research, vol. 36, no. 13-14, pp. 1455–1473, 2017.
  • [12] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87.   PMLR, 29–31 Oct 2018, pp. 651–673.
  • [13] D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” 2018.
  • [14] D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine, “Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods,” in IEEE Int’l Conf. on Robotics and Automation, 2018.
  • [15] A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo et al., “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in IEEE Int’l Conf. on Robotics and Automation, 2018.
  • [16] Y. Jiang, C. Zheng, M. Lim, and A. Saxena, “Learning to place new objects,” in Int’l Conf. on Robotics and Automation, 2012, pp. 3088–3095.
  • [17] M. Gualtieri, A. ten Pas, and R. Platt, “Pick and place without geometric object models,” in IEEE Int’l Conf. on Robotics and Automation, 2018.
  • [18] A. Xie, A. Singh, S. Levine, and C. Finn, “Few-shot goal inference for visuomotor learning and planning,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87.   PMLR, 29–31 Oct 2018, pp. 40–52.
  • [19] S. James, A. Davison, and E. Johns, “Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task,” in Conf. on Robot Learning, vol. 78.   Proceedings of Machine Learning Research, 2017, pp. 334–343.
  • [20] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in Advances in Neural Information Processing Systems, 2017, pp. 5048–5058.
  • [21] J. Kober, A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  • [22] C. Watkins, “Learning from delayed rewards,” 1989.
  • [23] G. Rummery and M. Niranjan, “On-line Q-learning using connectionist systems,” Cambridge University Engineering Department, CUED/F-INFENG/TR 166, September 1994.
  • [24] T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [25] C. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  • [26] T. Jaakkola, M. Jordan, and S. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems, 1994, pp. 703–710.
  • [27] L. Li, T. Walsh, and M. Littman, “Towards a unified theory of state abstraction for MDPs,” in Int’l Symp. on Artificial Intelligence and Mathematics, 2006.
  • [28] N. Sprague and D. Ballard, “Eye movements for reward maximization,” in Advances in neural information processing systems, 2004, pp. 1467–1474.
  • [29]

    H. Larochelle and G. Hinton, “Learning to combine foveal glimpses with a third-order Boltzmann machine,” in

    Advances in neural information processing systems, 2010, pp. 1243–1251.
  • [30] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014, pp. 2204–2212.
  • [31]

    M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in

    Advances in neural information processing systems, 2015, pp. 2017–2025.
  • [32] M. Gualtieri and R. Platt, “Viewpoint selection for grasp detection,” in IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems, 2017, pp. 258–264.
  • [33] J. L. Douglas Morrison, Peter Corke, “Multi-view picking: Next-best-view reaching for improved grasping in clutter,” in IEEE Int’l Conf. on Robotics and Automation, 2019.
  • [34] R. Sutton and A. Barto, Reinforcement Learning: An Introduction, 2nd ed.   MIT Press Cambridge, 2018.
  • [35] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Intll Conf. on Learning Representations, 12 2014.
  • [36] W. Wohlkinger, A. Aldoma, R. Rusu, and M. Vincze, “3DNet: Large-scale object class recognition from CAD models,” in IEEE Int’l Conf. on Robotics and Automation, 2012, pp. 5384–5391.