Learning robotic manipulation has remained an active and challenging research area. This is because the real-world environments in which robots exist are large, dynamic, and complex. Partial observability – where the robot does not at once perceive the entire environment – is common and requires reasoning over past perceptions. Additionally, the ability to generalize to new situations is critical because, in the real world, new objects can appear in different places unexpectedly.
The particular problem addressed in this paper is the large space of possible robot observations and actions – how the robot processes its past and current perceptions to make high-dimensional decisions. Visual attention has long been suggested as a solution to this problem . Focused perceptions can ignore irrelevant details, and generalization is improved by the elimination of the many irrelevant combinations of object arrangements . Additionally, as we later show, attention can result in a substantial reduction to the number of actions that need considered. Indeed, when selecting position, the number of action choices can become logarithmic rather than linear in the volume of the robot’s workspace. In spite of these benefits, visual attention has largely not caught on due to (a) the additional burden of learning where to attend and (b) additional partial observability caused by the narrowed focus.
We address the first challenge – efficiently learning where to attend – by constraining the system to a spatial hierarchy of attention. On a high level this means the robot must first see a large part of the scene in low detail, select a position within that observation, and see the next observation in more detail at the position previously selected, and so on for a fixed number of gazes. We address the second challenge – partial observability induced by the narrowed focus – by identifying attention with a type of state-abstraction which preserves the ability to learn optimal policies with efficient reinforcement learning (RL) algorithms.
This article extends our prior work , wherein we introduced the hierarchical spatial attention (HSA) approach and demonstrated it on 3 challenging, 6-DoF, pick-place tasks. New additions include (a) faster training and inference times, (b) more ablation studies and comparisons to related work, (c) better understanding of when an optimal policy can be learned when using this approach, (d) longer time horizons, and (e) improved real-robot experimental results.
The rest of the paper is organized as follows. First is related work (Section II). Next, the general manipulation problem is described and the visual attention aspect is added (Sections III and IV-A). After that, the HSA constraints are added, and this approach is viewed as a generalization of earlier approaches (Section IV-B to IV-D). The bulk of the paper includes analysis and comparisons in 5 domains of increasing complexity (Section V). Real robot experiments are described close to the end (Sections V-D and V-E). Finally, we conclude with what we learned and future directions (Section VI).
Ii Related Work
This work is most related to robotic manipulation, reinforcement learning, and attention models. It is extends our prior research on 6-DoF pick-place and primarily builds on DQN  and Deictic Image Mapping .
Ii-a Learning Robotic Manipulation
. While these systems can be quite robust in controlled environments, they encounter failures when the shapes of the objects differ from expected. Recent work has demonstrated grasping of novel objects by employing techniques intended to address the problem of generalization in machine learning[7, 8, 9, 10, 11, 12, 13, 14, 15].
There have been attempts to extend novel object grasping to more complex tasks such as pick-place. However, these have assumed either fixed grasp choices  or fixed place choices . The objective of the present work is to generalize these attempts – a single system that can find 6-DoF grasp and place poses.
Research contemporary with ours considers grasping and pushing novel objects to a target location . Their approach is quite different: a predictive model of the environment is learned and used for planning, whereas we aim to learn a policy directly. Other work has considered the problem of domain transfer  and sparse rewards in RL . We view these as complimentary ideas that could be combined with our approach for an improvement.
Ii-B Reinforcement Learning
Like several others, we apply RL techniques to the problem of robotic manipulation (see above-mentioned [9, 12, 14, 17, 20] and survey ). RL is appealing for robotic control for several reasons. First, several algorithms (e.g., [22, 23]) do not require a complete model of the environment. This is of particular relevance to robotics, where the environment is dynamic and difficult to describe exactly. Additionally, observations are often encoded as camera or depth sensor images. Deep Q-Networks (DQN) demonstrated an agent learning difficult tasks (Atari games) where observations were image sequences and actions were discrete . An alternative to DQN that can handle continuous action spaces are actor-critic methods like DDPG . Finally, RL – which has its roots in optimal control – provides tools for the analysis of learning optimal behavior (e.g. [25, 26, 27]), which we refer to in Section V-A.
Ii-C Attention Models
Our approach is inspired by models of visual attention. Following the early work of Whitehead and Ballard , we distinguish overt actions (which directly affect change to the environment) from perceptual actions (which retrieve information). Similar to their agent model, our abstract robot has a virtual sensor which can be used to focus attention on task-relevant parts of the scene. The present work updates their methodology to address more realistic problems, and we extend their analysis by describing a situation where an optimal policy can be learned even in the presence of “perceptual aliasing” (i.e. partial observability).
Attention mechanisms have also been used with artificial neural networks to identify an object of interest in a 2D image[28, 29, 30, 31]. Our situation is more complex in that we identify 6-DoF poses of the robot’s hand. Improved grasp performance has been observed by active control of the robot’s sensor [32, 33]. These methods attempt to identify the best sensor placement for grasp success. In contrast, our robot learns to control a virtual sensor for the purpose of reducing the complexity of action selection and learning.
Iii Problem Statement
Definition 1 (Move-Effect System).
A move-effect system is a discrete time system consisting of a robot, equipped with a depth sensor and end effector (e.e.), and rigid objects of various shapes and configurations. The robot perceives a history of point clouds , where is acquired by the depth sensor; an e.e. status, ; and a reward . The robot’s action is , where is the pose of the e.e., is the robot’s workspace, and is a preprogrammed controller for the e.e. For each stage , the robot receives a new perception and takes an action.
The reward is usually instrumented by the system engineer to indicate progress toward completion of some desired task. The robot initially has no knowledge of the system’s state transition dynamics. The objective is, by experiencing a sequence of episodes, for the robot to learn a policy – a mapping from observations to actions – which maximizes the expected sum of per-episode rewards.
For example, suppose the e.e. is a 2-fingered gripper, , , the objects are bottles and coasters, and the task is to place all the bottles on the coasters. The reward could be for placing a bottle on a coaster, for removing a placed bottle, and otherwise.
Our approach has two parts. The first part is to reformulate the problem as a Markov decision process (MDP) with abstract states and actions (SectionIV-A). With this reformulation, the resulting state representation is substantially reduced, and it becomes possible for the robot to learn to restrict attention to task-relevant parts of the scene. The second part is to add constraints to the actions such that the e.e. pose is decided sequentially (Section IV-B). After these improvements, the problem is then amenable to solution via standard RL algorithms like DQN (examples in Section V).
Iv-a Sense-Move-Effect MDP
The sense-move-effect system adds a controllable, virtual sensor which perceives a portion of the point cloud from a parameterizable perspective (Fig. 2).
Definition 2 (Sense-Move-Effect System).
A sense-move-effect system is a move-effect system where the robot’s actions are augmented with (where and ) and the point cloud observations are replaced with a history of images, (where ). The action has the effect of adding to the history.111 is orthographic projections of points onto , images. Each image plane is positioned at the origin with a different orientation. Image values are the point to plane distance, ambiguities resolved with the nearest distance. returns the points of which lie inside a rectangular volume situated at the origin with length, width, height . expresses (initially expressed w.rt. the world frame) w.r.t. .
The action makes it possible for the robot to get either a compact overview of the scene or to attend to a small part of the scene in detail. Since the resolution of the images is fixed, large values of correspond to seeing more objects in less detail, and small values of correspond to seeing less objects in more detail.
The robot’s memory need not include the last images – it can include any previous images selected according to a predetermined strategy. Because the environment only changes after actions, we keep the latest image, , and the last images that appeared just before actions. Fig. 3 shows an example in the bottles on coasters domain.
In order to apply standard RL algorithms to the problem of learning to control a sense-move-effect system, we define the sense-move-effect MDP.
Definition 3 (Sense-Move-Effect MDP).
Given a sense-move-effect system, a reward function, and transition dynamics, a sense-move-effect MDP is a finite horizon MDP where states are sense-move-effect system observations and actions are sense-move-effect system actions.
The reward function and transition details are task and domain specific, respectively, examples of which are given in Section V.
Iv-B Hierarchical Spatial Attention
The observation is now similar to that of DQN – a short history of images plus the e.e. status – and can be used by a Q-network to approximate Q-values. However, the action space remains large due to the 6-DoF choice for or and the 3-DoF choice for . Additionally, it may take a long time for the robot to learn which actions result in useful observations. To remedy both issues, we design constraints to the sense-move-effect actions.
Definition 4 (Hierarchical Spatial Attention).
Given a sense-move-effect system, , , and the list of pairs , (where and ), hierarchical spatial attention (HSA) constrains the robot to take actions, with for , prior to each action. Furthermore, the first sensor pose in this sequence must be ; the sensor poses , for , must be offset no more than from ; and e.e. pose must be offset no more than of .222Concretely, indicates a position offset of , , and and rotation offset of , , and .
The process is thus divided into overt stages, where, for each stage, actions are followed by 1 action (Fig. 4). The constraints should be set such that the observation size and offset decrease as increases, so the point cloud under observation decreases in size, and the volume within which the e.e. pose can be selected is also decreasing. These constraints are called hierarchical spatial attention because the robot is forced to learn to attend to a small part of the scene (e.g., Fig. 5).
To see how HSA can improve action sample efficiency, consider the problem of selecting position in a 3D volume. Let be the largest volume allowed per sample. With naive sampling, the required number of samples is proportional to the workspace volume , i.e., . But with HSA, we select position sequentially, by say, halving the volume size in each direction at each step, i.e., . In this case samples are needed – 1 for each part of the divided space at each step, and the volume under focus is . To get , , so the sample complexity becomes logarithmic, rather than linear, in .
Iv-C Lookahead Sense-Move-Effect
So far we have not specified how action parameters , , and are encoded. For standard sense-move-effect, these are typically encoded as 6 floating point numbers representing the pose and 3 floating point numbers representing the volume size . Alternatively, the pair could be encoded as the sense image that would be seen if the sensor were to move to pose with zoom . This is as if the action was “looking ahead” at the pose the sensor or e.e. would move to if this action were selected.
In particular, the lookahead sense-move-effect MDP has actions and , the difference being the additional parameter for . The action samples are encoded as the height map that would be generated by . Because action has this rich encoding, state is just the e.e. status and a history of actions.
The HSA constraints for the lookahead variant have the same parameterization – an initial pose and a list of pairs . The semantics are slightly different. for is the parameter for the th , and is the parameter. The for specify the offset of the action samples relative to the last pose decided, . specifies the offset of relative to .
Iv-D Relation to Other Approaches in the Literature
Consider a sense-move-effect MDP with HSA constraints , centered in the robot’s workspace, and and large enough to capture the entire workspace. The only free action parameters for this system are the e.e. pose, which is sampled uniformly and spaced appropriately for the task, and the e.e. operation. In this case, the observations and actions are similar to that of DQN , and the DQN algorithm can be applied to the resulting MDP.
However, this approach is problematic in robotics because the required number of action samples is large, and the image resolution would need to be high in order to capture the required details of the scene. For example, a pick-place task where e.e. poses are in , the robot workspace is 1 m, the required position precision is 1 mm, and the required orientation resolution is 1 per Euler angle requires on the order of actions. Adding more levels (i.e. ) alleviates this problem.
Iv-D2 Deictic Image Mapping
With , centered in the robot’s workspace, the deictic marker size (e.g., the size of the largest object to be manipulated), and large enough to capture the entire workspace, HSA applied to the lookahead sense-move-effect MDP is the Deictic Image Mapping representation . Similar to the case with DQN, if the space of e.e. poses is large, and precise positioning is needed, many actions need to be sampled. In fact, the computational burden with the Deictic Image Mapping representation is even larger than that of DQN due to the need to create images for each action. Yet, the deictic representation has significant advantages over DQN in terms of efficient learning due to its small observations .
HSA generalizes and improves upon both DQN and Deictic Image Mapping by overcoming the burden for the agent to select from many actions in a single time step. Instead, the agent sequentially refines its choice of e.e. pose over a sequence of decisions. We provide comparisons between these approaches in the following sections.
V Application Domains
In this section we compare the HSA approach in 5 application domains of increasing complexity. The complexity increases in terms of the size of the action space and in terms of the diversity of object poses and geometries. We analyze simpler domains because the results are more interpretable and learning is faster (Table I). More complex domains are included to demonstrate the practicality of the approach. All training is in simulation, and Sections V-D and V-E include test results for a psychical robotic system.
|Tabular Pegs on Disks||Upright Pegs on Disks||Pegs on Disks||Bottles on Coasters||6-Dof Pick-Place|
V-a Tabular Pegs on Disks
Here we analyze the HSA approach applied to a simple, tabular domain, where the number of states and actions is finite. The domain consists of 2 types of objects – pegs and disks – which are situated on a 3D grid (Fig. 6). The robot can move its e.e. to a location on the grid and open/close its gripper. The goal is for the robot to place all the pegs onto disks.
If this problem is described as a finite MDP, eventual convergence to the optimal policy is guaranteed for standard RL algorithms [25, 26]. However, the number of state-action pairs is too large for practical implementation unless some abstraction is applied. The main question addressed here is if convergence guarantees are maintained with the HSA abstraction.
V-A1 Ground MDP
Tabular pegs on disks is first described without the sense-move-effect abstraction.
State. A set of pegs , a set of disks , and the current time . A peg (resp. disk) is a location except peg locations are augmented with a special in-hand location . Pegs (resp. disks) cannot occupy the same location at the same time, but 1 peg and 1 disk can occupy the same location at the same time.
Action. , which moves the e.e. to and opens/closes. It opens if a peg is located at and closes otherwise.
Transition. increments by 1. If no peg is at and a peg is at the action location, then the peg is grasped (). If a peg is located at and the action location does not contain a peg, the peg is placed (). Otherwise, the state remains unchanged.
Reward. 1 if a peg is placed on an unoccupied disk, -1 if a placed peg is removed, and 0 otherwise.
Initially, pegs and disks are at distinct locations, and no peg is in the e.e. The time horizon is , where there is enough time to grasp and place each object. This MDP satisfies the Markov property because the next state is completely determined from the current state and action. The number of possible states is shown in Eq. 1, and the number of actions is . It is not practical to learn the optimal policy by enumerating all state-action pairs for this MDP: for example, if and , the state-action value lookup table size is on the order of .
V-A2 Sense-Move-Effect MDP
We apply the sense-move-effect abstraction of Section IV-A and HSA constraints of Section IV-B to the tabular pegs on disks problem. The process is illustrated in Fig. 7. At level 1, the sensor perceives the entire grid as 8 cells, each cell summarizing the contents of an octant of space in the underlying grid. The robot then selects one of these cells to attend to. At levels , the sensor perceives 8 cells revealing more detail of the octant selected in the previous level. At level , the sensor perceives 8 cells in the underlying grid, and the location of the underlying action is determined by the cell selected here. Without loss of generality, assume the grid size of the ground MDP is a power of 2 and the number of levels is .
State. The current level , the overt time step , a bit indicating if a peg is held, and the tuple where each . indicates the presence of unplaced pegs, unoccupied disks, placed pegs, and empty space.
Action. The action is , a location in the observed grids.
Transition. For levels , the robot selects a cell in which corresponds to some partition of space in the underlying grid. The sensor perceives this part of the underlying grid and generates the observation at level . For level , the selections determine the location of the underlying action, is reset to 1, and otherwise the transition is the same as in the ground MDP.
Reward. The reward is 0 for levels . Otherwise, the reward is the same as for the ground MDP.
The above process is no longer Markov because a history of states and actions could be used to better predict the next state. For instance, for a sufficiently long random walk, the exact location of all pegs and disks could be determined from the history of observations, and the underlying grid could be reconstructed.
On the other hand, this abstraction results in substantial savings in terms of the number of states (Eq. 2) and actions (). The only nonconstant term (besides ) is logarithmic in . Referring to the earlier example with and , the state-action lookup table size is the order of .
V-A3 Theoretical Results
The sense-move-effect MDP with HSA constraints can be classified according to the state abstraction ordering defined in Li et al.. In particular, we show -irrelevance, which is sufficient for the convergence of a number of RL algorithms, including -learning, to a policy optimal in the ground MDP.
Definition 5 (-irrelevance Abstraction ).
Given an MDP , any states , and an arbitrary but fixed weighting function , a -irrelevance abstraction is such that for any action , implies .
is a mapping from ground states to abstract states and defines the abstract MDP.333Although the definition is for infinite-horizon problems (due to ), our finite-horizon problem readily converts to an infinite-horizon problem by adding an absorbing state that is reached after overt stages. The weight is the probability the underlying state is
is the probability the underlying state isgiven its abstract state is observed. Any fixed policy, e.g. -greedy with fixed , induces a valid and satisfies the definition.
Theorem 1 (Convergence of Q-learning under -irrelevance ).
Assume that each state-action pair is visited infinitely often and the step-size parameters decay appropriately. Q-learning with abstraction converges to the optimal state-action value function in the ground MDP. Therefore, the resulting optimal abstract policy is also optimal in the ground MDP.
Because Li et al. do not consider action abstractions, we redefine the ground MDP to have the same actions as sense-move-effect MDP. Additionally, to keep the ground MDP Markov, we add the current level , and the current point of focus , to the state. This does not essentially change the tabular pegs on disks domain but merely allows us to rigourously make the following connection.
Let states and actions of the ground MDP be denoted by and respectively. Similarly, let states and actions of the sense-move-effect MDP be denoted by and respectively. Let be the observation function.
Theorem 2 ( is -irrelevant).
The sense-move-effect abstraction, , is a -irrelevance abstraction.
can be computed from and . The reward after the current overt stage depends on , whether or not it is possible to select a peg/disk, and whether or not it is possible to avoid selecting a placed peg. These are known from and . Furthermore, whether or not a peg will be held after the current stage can be determined from and . Finally, due to and the fact that all pegs are initially unplaced, the sum of future rewards following an optimal policy from the current stage depends only on (a) whether or not a peg will be held after the current stage and (b) the amount of time left, . ∎
V-A4 Simulation Results
The proof to Theorem 2 suggests that the observability of pegs, disks, placed pegs, and empty space are all important for learning the optimal policy. On the other hand, we empirically found no disadvantage to removing the (placed pegs) and (empty space) grids. However, it is important to distinguish unplaced pegs and placed pegs. Fig. 8 shows learning curves for an HSA agent with and grids versus an HSA agent with the same grids but showing pegs/disks irregardless of whether or not they are placed/occupied.
Lookahead HSA and Deictic Image Mapping variants (Section IV-C and IV-D) result in an even smaller state-action space than standard HSA. In the tabular domain, this means faster convergence (Fig. 9). Although the deictic representation seems superior in these results, it has a serious drawback. The action-selection time scales linearly with because there is one action for each cell in the underlying grid. The lookahead variant captures the best of both worlds – small representation and fast execution. Thus, in the tabular domain, lookahead appears to be the satisfactory middle ground between the two approaches. However, for the more complex domains, where Q-function approximation is required, the constant time needed to generate the action images becomes more significant, and the advantage of lookahead in terms of episodes to train diminishes. The latter may be because the function approximator (a CNN) has built-in position equivariance, taking away some of the advantage from the lookahead representation.
In each of these experiments, the learning algorithm was Sarsa 
, and actions were taken greedily w.r.t. the current Q-estimate. An optimistic initialization of action-values and random tie-breaking was relied on for exploration. Under these conditions, Q-learning produces the same learning curves as Sarsa.
V-B Upright Pegs on Disks
In this domain, pegs and disks are modeled as tall and flat cylinders, respectively, where the cylinder axis is always vertical (Fig. 10, left). Unlike the tabular domain, object size and position are sampled from a continuous space. Grasp and place success are checked with a set of simple conditions appropriate for upright cylinders.444Grasp conditions: gripper is collision-free and the top-center of exactly 1 cylinder is in the gripper’s closing region. Place conditions: entire cylinder is above an unoccupied disk and the cylinder bottom is at most 1 cm below or 2 cm above the disk surface. The reward is 1 for grasping an unplaced peg, -1 for grasping a placed peg, 1 for placing a peg on an unoccupied disk, and 0 otherwise.
Observations consist of 1 or 2 images (, , ); the current HSA level, ; and the e.e. status, . Each HSA level selects position (Fig. 10, right). Gripper orientation is not critical for this problem.
V-B1 Network Architecture and Algorithm
The Q-function consists of 6 convolutional neural networks (CNNs), 1 for each level and e.e. status, with identical architecture (TableII). This architecture results in faster execution time compared with our previous version . The loss is the squared difference between the predicted and the actual action-value target, averaged over a mini-batch. The action-value target is the reward received at the end of the current overt stage.555With standard Monte Carlo (MC) and , the action-value target would be the sum of rewards received after the current time step . Since, for this problem, no positively rewarding grasp precludes a positively rewarding place, ignoring rewards after the current overt stage is acceptable. For CNN optimization, Adam  is used with a base learning rate of , weight decay of , and mini-batch size of 64.
V-B2 Simulation Results
We tested the standard HSA approach with 1, 2, and 3 levels. The number of actions (CNN outputs) were set to have the same precision in positioning of the e.e.: 1 level had outputs, 2 levels outputs and outputs, and 3 levels each had outputs. Note that with 1 level this is comparable to the DQN (i.e. no-hierarchy) approach (Section IV-D). Exploration was -greedy with set such that the probability of taking 1 random action in an overt stage was the same: 0.04 for 3 levels, 0.059 for 2 levels, and 0.115 for 1 level.
Results are shown in Fig. 11. The 1 and 3 level cases perform similarly in the end, with faster learning in terms of episodes for the 1 level case. However, execution is slower for 1 level due to the large number of actions that need to be processed. We were not able to determine why 2 levels performs worse than 1 or 3 levels.
We tested standard HSA and lookahead HSA with 3 levels. Lookahead HSA with 1 level would be the Deictic Image Mapping approach (Section IV-D), but computation of all images was prohibitively expensive. For this comparison, the number of actions was increased to per level (thus the improvement compared to Fig. 11). Results are shown in Fig. 12. As in the tabular domain, lookahead learns faster in terms of the number of episodes; however, the difference is less pronounced as before, and the execution time for standard HSA is less than half that of lookahead.
V-C Pegs on Disks
The pegs on disks domain is the same as the upright pegs on disks domain except the cylinder axis of pegs can be either horizontal or vertical, and a correct placement requires the peg axis to be vertical (Fig. 13).666Grasp conditions: gripper collision-free, exactly 1 peg intersects gripper closing region, peg axis intersects gripper closing region, fingers contact sides or ends (but not 1 of each). Place conditions: peg is vertical, entirely over an unoccupied disk, bottom cm of disk surface, and gripper not in collision. This seemingly innocuous change has important consequences in terms of problem complexity. First, the orientation of the gripper is important for both grasping and placing. Second, it is possible to grasp the peg by its ends, thereby preventing the possibility of a successful place in the next stage. Thus, the robot needs to reason longer-term about the consequences of its actions.
We examine 2 reward functions: the first is the same as with upright pegs on disks, and the second is the same except for a reward of 0 for a stable grasp where the fingers are on the cylinder ends. The latter is called the “shaped grasp reward function” and requires less long-term reasoning because grasps which cannot result in successful placements are penalized by the reward received after the current overt stage.
Observations are similar to before except now the image resolution is higher (), and the current overt time step is included whenever the grasp reward is unshaped. HSA has 3 levels selecting position and 1 level selecting orientation about the gripper approach axis (Fig. 14).
V-C1 Network Architecture and Algorithm
The network architecture is described in Table III. Weight decay changed to 0. When shaping grasp reward, Q-estimate targets are the sum of rewards for the current stage. When not shaping grasp reward, Q-estimate targets are computed via standard Monte Carlo (MC)  except the discount is accumulated only after the end of an overt stage (instead of after every time step).
|layer||kernel size||stride||output size|
|conv-4 / ip-1||/ -||1 / -|
V-C2 Simulation Results
The first question is if the robot can learn to perform long-term reasoning as well as it can learn to perform short-term reasoning. Tests are run with and without shaping the grasp reward (Fig. 15). For the unshaped case, we tested 7 different discount factors using the same random seed and chose the best one, . With reward shaping, learning converges faster and ultimately a better policy is learned.
The second question is how long-term reasoning is affected by the choice of RL algorithm. Here we test Monte Carlo , Sarsa , and Q-learning  (Fig. 16).777Specifically, 4-step Sarsa and Q-learning were used , which performed better than 1-step versions in preliminary tests. The discount factor is increased to to emphasize the difference between the methods. Q-learning and Sarsa do indeed learn faster in terms of episodes, but the end improvement is marginal. Note that MC executes faster, since the Q-function is not consulted when generating action-value targets.
V-D Bottles on Coasters
The main question addressed here is if HSA can be applied to a practical problem and implemented on a physical robotic system. The bottles on coasters domain is similar to the pegs on disks domain, but now objects have complex shapes and are required to be placed upright.888Grasp conditions: gripper closing region intersects exactly 1 object and the antipodal condition from  with friction cone. Place conditions: bottle is upright, center of mass (CoM) position at least 2 cm inside an unoccupied coaster, and bottom within cm of coaster surface. The reward is for grasping an unplaced object more than 4 cm from the bottom (bottom grasps are kinematically infeasible in the physical system), for grasping a placed object, for placing a bottle, and 0 otherwise.
Observations are the same as in the pegs on disks domain except the overt time step is always input to grasp networks (and never input to place networks). The 4-level hierarchy is illustrated in Fig. 5.
V-D1 Network Architecture and Algorithm
The network architecture is the same as in Table III except the number of kernels for conv-1 and conv-3 were increased to 64. Weight decay is again 0. Q-network targets are the reward after the current overt stage.
V-D2 Simulation Results
70 bottles from 3DNet  were randomly scaled to height 10-20 cm. Bottles were placed upright with probability and on their sides with probability . Learning curves for 2 bottles and 2 coasters are shown in Fig. 17. Performance is somewhat worse than that of the pegs on disks domain, reflective of the additional problem complexity.
V-D3 Top- Sampling
Before considering experiments on a physical robotic system, we address an important assumption of the move-effect system of Section III. The assumption is the e.e. can move to any pose, , in the robot’s workspace. Recent advances in motion planning algorithms make this a reasonable assumption for the most part; nonetheless, a pose can still sometimes be unreachable due to obstacles, motion planning failure, or IK failure.
To address this issue, multiple, high-valued actions are sampled from the policy learned in simulation. In particular, for each level of an overt stage, we take the top- samples according to Eq. 3, where is the action-value estimate at level , is the maximum possible action-value, is the minimum possible action-value, and .
Preliminary tests in simulation showed sampling top- values performs better than sampling top- values, as was done previously .
During test time, the resulting , samples are checked for IK and motion plan solution in descending order of value. As increases, the probability of failing to find a reachable e.e. pose decreases; however, the more poses that are unreachable, the lower the value. Thus, when designing an HSA system, it is important to not over constrain the space of actions.
V-D4 Robot Experiments
We tested the bottles on coasters task with the physical system depicted in Fig. 18. The system consists of a Universal Robots 5 (UR5) arm, a Robotiq 85 parallel-jaw gripper, and a Structure depth sensor. The test objects (Fig.19) were not observed during training. The CNN weight files had about average performance out of the 10 realizations (Fig. 17).
Initially, 2 coasters were randomly selected and placed in arbitrary positions in the back half of the robot’s workspace (too close resulted in unreachable places). Then, 2 bottles were randomly selected and placed upright with probability and on the side with probability . The bottles were not allowed to be placed over a coaster.999Python’s pseudorandom number generator was used to decide the objects used and upright/side placement. Object position was decided by a human instructed to make the scenes diverse. Top- sampling with was used. A threshold was set for the final grasp/place approach, whereby, if the magnitude of the force on the arm exceeded this threshold, the motion canceled and the open/close action was immediately performed.
Results are summarized in Table IV, and a successful sequence is depicted in Fig. 20. A grasp was considered successful if a bottle was lifted to the “home” configuration; a place was considered successful if a bottle was placed upright on an unoccupied coaster and remained there after the gripper withdrew. Failures were: grasped a placed object (), placed too close to the edge of a coaster and fell over (), placed upside-down (), object slip in hand after grasp caused a place failure (), and object fell out of hand after grasp ().
|Number of Objects||1.97||1.67|
V-E 6-DoF Pick-Place
The HSA method was also implemented for 6-DoF manipulation, and the same system was tested on 3 different pick-place tasks .101010This section refers to an earlier version of our system, so the simulations took longer and the success rates for bottles are lower. The setup was similar to that in Fig. 18 except the sensor was mounted to the wrist. See  for more details. The tasks included stacking a block on top of another, placing a mug upright on the table, and (similar to Section V-D) placing a bottle on a coaster. All tasks included novel objects in light to moderate clutter (Fig. 21). To handle perceptual ambiguities in mugs, the observations were 3-channel images (, , ) projected from a point cloud obtained from 2 camera poses. HSA included 6 levels () – 3 for position and 1 for each Euler angle. Results from UR5 experiments are shown in Table V.
The primary conclusion is that the sense-move-effect abstraction, when coupled with hierarchical spatial attention, is an effective way of simultaneously handling (a) high-resolution 3D observations and (b) high-dimensional, continuous action spaces. These two issues are intrinsic to realistic problems of robot learning. We provide several other considerations relevant to systems employing spatial attention:
Vi-a Secondary Conclusions
Compared to a flat representation, HSA can result in an exponential reduction in the number of actions that need to be sampled (Section IV-B).
HSA generalizes DQN, and lookahead HSA generalizes Deictic Image Mapping (Section IV-D).
The partial observability induced by an HSA observation does not preclude learning an optimal policy (Section V-A).
HSA may take longer to learn than DQN in terms of the number of episodes to convergence, but HSA executes faster when the number of actions is large (Section V-B).
Reward shaping can improve learning performance when there are long-term consequences of actions (Section V-C).
Vi-B Limitations and Future Work
With our approach, the system designer still needs to select HSA parameters and possibly a reward function. In principle, these could also be learned by the robot, but this would likely require more training. Another concern is, even for some of the simpler problems, an optimal policy was not learned in simulation. To some extent this may be remedied by top- sampling. (Notice the real robot results in Section V-D are better than the simulation results.) However, we would like to understand and address this problem more rigorously. Finally, how manipulation skills can be automatically and efficiently transferred to different but related tasks remains an interesting, open question. Even small changes to the task, such as including distractor objects, requires complete retraining of the system for maximum performance.
We thank Andreas ten Pas for reviewing an early draft of this paper and the anonymous reviewers of  for their insightful comments.
-  S. Whitehead and D. Ballard, “Learning to perceive and act by trial and error,” Machine Learning, vol. 7, no. 1, pp. 45–83, 1991.
-  M. Gualtieri and R. Platt, “Learning 6-DoF grasping and pick-place using attention focus,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87, Oct 2018, pp. 477–486.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
R. Platt, C. Kohler, and M. Gualtieri, “Deictic image maps: An abstraction for
learning pose invariant manipulation policies,” in
AAAI Conf. on Artificial Intelligence, 2019.
-  T. Lozano-Pérez, “Motion planning and the design of orienting devices for vibratory part feeders,” IEEE Journal Of Robotics And Automation, 1986.
-  J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield, “Deep object pose estimation for semantic robotic grasping of household objects,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87. PMLR, Oct 2018, pp. 306–316.
I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic grasps,”The Int’l Journal of Robotics Research, vol. 34, no. 4-5, pp. 705–724, 2015.
L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” inIEEE Int’l Conf. on Robotics and Automation, 2016, pp. 3406–3413.
-  S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with large-scale data collection,” in Int’l Symp. on Experimental Robotics. Springer, 2016, pp. 173–184.
-  J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” Robotics: Science and Systems, vol. 13, 2017.
-  A. ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp pose detection in point clouds,” The Int’l Journal of Robotics Research, vol. 36, no. 13-14, pp. 1455–1473, 2017.
-  D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine, “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87. PMLR, 29–31 Oct 2018, pp. 651–673.
-  D. Morrison, P. Corke, and J. Leitner, “Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach,” 2018.
-  D. Quillen, E. Jang, O. Nachum, C. Finn, J. Ibarz, and S. Levine, “Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-policy methods,” in IEEE Int’l Conf. on Robotics and Automation, 2018.
-  A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo et al., “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in IEEE Int’l Conf. on Robotics and Automation, 2018.
-  Y. Jiang, C. Zheng, M. Lim, and A. Saxena, “Learning to place new objects,” in Int’l Conf. on Robotics and Automation, 2012, pp. 3088–3095.
-  M. Gualtieri, A. ten Pas, and R. Platt, “Pick and place without geometric object models,” in IEEE Int’l Conf. on Robotics and Automation, 2018.
-  A. Xie, A. Singh, S. Levine, and C. Finn, “Few-shot goal inference for visuomotor learning and planning,” in Proceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, vol. 87. PMLR, 29–31 Oct 2018, pp. 40–52.
-  S. James, A. Davison, and E. Johns, “Transferring end-to-end visuomotor control from simulation to real world for a multi-stage task,” in Conf. on Robot Learning, vol. 78. Proceedings of Machine Learning Research, 2017, pp. 334–343.
-  M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsight experience replay,” in Advances in Neural Information Processing Systems, 2017, pp. 5048–5058.
-  J. Kober, A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
-  C. Watkins, “Learning from delayed rewards,” 1989.
-  G. Rummery and M. Niranjan, “On-line Q-learning using connectionist systems,” Cambridge University Engineering Department, CUED/F-INFENG/TR 166, September 1994.
-  T. Lillicrap, J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  C. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
-  T. Jaakkola, M. Jordan, and S. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” in Advances in neural information processing systems, 1994, pp. 703–710.
-  L. Li, T. Walsh, and M. Littman, “Towards a unified theory of state abstraction for MDPs,” in Int’l Symp. on Artificial Intelligence and Mathematics, 2006.
-  N. Sprague and D. Ballard, “Eye movements for reward maximization,” in Advances in neural information processing systems, 2004, pp. 1467–1474.
H. Larochelle and G. Hinton, “Learning to combine foveal glimpses with a third-order Boltzmann machine,” inAdvances in neural information processing systems, 2010, pp. 1243–1251.
-  V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014, pp. 2204–2212.
M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” inAdvances in neural information processing systems, 2015, pp. 2017–2025.
-  M. Gualtieri and R. Platt, “Viewpoint selection for grasp detection,” in IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems, 2017, pp. 258–264.
-  J. L. Douglas Morrison, Peter Corke, “Multi-view picking: Next-best-view reaching for improved grasping in clutter,” in IEEE Int’l Conf. on Robotics and Automation, 2019.
-  R. Sutton and A. Barto, Reinforcement Learning: An Introduction, 2nd ed. MIT Press Cambridge, 2018.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Intll Conf. on Learning Representations, 12 2014.
-  W. Wohlkinger, A. Aldoma, R. Rusu, and M. Vincze, “3DNet: Large-scale object class recognition from CAD models,” in IEEE Int’l Conf. on Robotics and Automation, 2012, pp. 5384–5391.