Log In Sign Up

Spatial Action Maps for Mobile Manipulation

by   Jimmy Wu, et al.

This paper proposes a new action representation for learning to perform complex mobile manipulation tasks. In a typical deep Q-learning setup, a convolutional neural network (ConvNet) is trained to map from an image representing the current state (e.g., a birds-eye view of a SLAM reconstruction of the scene) to predicted Q-values for a small set of steering command actions (step forward, turn right, turn left, etc.). Instead, we propose an action representation in the same domain as the state: "spatial action maps." In our proposal, the set of possible actions is represented by pixels of an image, where each pixel represents a trajectory to the corresponding scene location along a shortest path through obstacles of the partially reconstructed scene. A significant advantage of this approach is that the spatial position of each state-action value prediction represents a local milestone (local end-point) for the agent's policy, which may be easily recognizable in local visual patterns of the state image. A second advantage is that atomic actions can perform long-range plans (follow the shortest path to a point on the other side of the scene), and thus it is simpler to learn complex behaviors with a deep Q-network. A third advantage is that we can use a fully convolutional network (FCN) with skip connections to learn the mapping from state images to pixel-aligned action images efficiently. During experiments with a robot that learns to push objects to a goal location, we find that policies learned with this proposed action representation achieve significantly better performance than traditional alternatives.


page 1

page 2

page 3

page 4

page 5

page 8


Finding Shortest Path for Developed Cognitive Map Using Medial Axis

this paper presents an enhancement of the medial axis algorithm to be us...

Implicit Kinematic Policies: Unifying Joint and Cartesian Action Spaces in End-to-End Robot Learning

Action representation is an important yet often overlooked aspect in end...

Actionness Estimation Using Hybrid Fully Convolutional Networks

Actionness was introduced to quantify the likelihood of containing a gen...

Spatial Intention Maps for Multi-Agent Mobile Manipulation

The ability to communicate intention enables decentralized multi-agent r...

UMPNet: Universal Manipulation Policy Network for Articulated Objects

We introduce the Universal Manipulation Policy Network (UMPNet) – a sing...

Model-based Behavioral Cloning with Future Image Similarity Learning

We present a visual imitation learning framework that enables learning o...

Equivariant Q Learning in Spatial Action Spaces

Recently, a variety of new equivariant neural network model architecture...

I Introduction

The prolific success of deep learning in vision 

[13, 14, 16, 33] has inspired much recent work on using deep convolutional networks (ConvNets) for visual navigation and mobile manipulation in robotics. In such an approach, ConvNets are trained to model a policy that maps from an agent’s observation of the state (e.g.

, images) to the probability distribution over actions (

e.g., steering commands) maximizing expected task success.

Most work in this domain has focused on methods to train the ConvNets: from learning to imitate human control strategies [4, 11, 22, 24]

, to discovering more complex long-term path planning behaviors with reinforcement learning 

[7, 20, 29, 47]. Other work has considered the impact of different state representations: from front-facing camera images [4, 22, 24, 26] to top-down bird’s-eye views (BEV) of the scene generated with Inverse Perspective Mapping (IPM) [5, 7, 11, 29]. However, there has been very little work on the impact of different action representations. Almost all navigation systems based on ConvNets consider only a small set of (possibly parameterized) egocentric steering commands (e.g., move forward/backward, strafe right/left, rotate left/right, etc.) [7, 11, 12, 20, 24, 26, 29, 47].

Fig. 1: In this work, we propose “spatial action maps” as a representation for mobile manipulation tasks. In our setting, the goal is to train an agent to navigate around an unseen environment, find objects, and push them into a designated target receptacle. It does so by iteratively predicting the Q-value of navigating to each location in a spatial action map (red is better), and then selecting the best location to go to (marked with ‘x’).

Representing actions with steering commands presents several problems for learning complex mobile manipulation tasks (Fig. 2a). First, they are myopic, and thus each action can reach only a small subset of possible end-points (i.e., ones at a fixed distance and reachable by unobstructed straight-line paths). Second, they each invoke only a small change to the state, and thus a long sequence of actions is required to make significant change (e.g., navigate through a series of obstacles). Third, they require a deep Q-network to learn a complex mapping from a high-dimensional input state representation (usually an image) to a low-dimensional set of (possibly parameterized) action classes, which may require many training examples.

In this paper, we advocate for a new action representation, “spatial action maps” (Fig. 1). The main idea is to represent actions as a dense map of navigation end-points: each action represents a move to an end-point, possibly along a non-linear trajectory, and possibly with a task to perform there (Fig. 2b-c). The advantages of this action representation are three-fold. First, the agent is not myopic – it can move to end-points in a single action [1]. Second, each single action can represent an arbitrarily complex navigation trajectory – ours follow shortest paths to end-points while avoiding obstacles (Fig. 2c). Third, it simplifies the mapping from states to actions – in our system, each state is represented by an image of the reconstructed scene from a bird’s-eye view (in IPM coordinates), and the action space is represented by an image representing the expected Q-value of navigating to every end-point (also in IPM coordinates). Since the state and action space lie in the same domain and are pixel-aligned, we can train a fully convolutional network (FCN) to map between them – an efficient way to predict values for all possible actions in one forward pass of a network.

We study this action representation in the context of reinforcement learning for mobile manipulation, where an agent is tasked with exploring an unseen environment containing obstacles and objects, with the goal of navigating to push all objects into a designated target zone – i.e., pushing the objects into a receptacle like a bulldozer (Fig. 1

). At every step, the only information available to the agent is a bird’s-eye view image representing a partial reconstruction of its local environment (everything it has observed with a front-facing RGB-D camera, transformed into an IPM representation and accumulated over time to simulate online SLAM/reconstruction). The agent feeds the state image into an FCN to produce an action image encoding the Q-value of moving along the estimated shortest path trajectory to every end-point location. It executes the action with highest Q-value and iterates.

The spatial action map representation has the key benefit that the spatial position of each state-action value prediction (with respect to the input IPM view) represents a local milestone (trajectory end-point) for the agent’s control strategy. We conjecture that it is easier for convolutional networks to learn the state-action values of these navigational end-points (as opposed to abstract low-level steering commands), since each prediction is aligned and translationally anchored to the visual features of the objects and obstacles directly observed in the input map. This is motivated by gains in performance observed in other domains – for example, image segmentation [14] has been shown to benefit from pixel-aligned input and output representations, while end-to-end robotic manipulation [21, 45, 44, 43] is significantly more sample efficient when using FCNs to predict state-action values for a dense sampling of grasps aligned with the visual input. Using spatial action maps, our experiments show that we are capable of training end-to-end mobile manipulation policies that generalize to new environments with fewer than 60k training samples (state-action pairs). This is orders of magnitude less data than prior work [20, 47].

Our main contribution in this paper is the spatial action map representation for mobile manipulation. We not only propose the representation, but also investigate its use with a variety of action primitives (take a short step, follow the straight-line path, follow the shortest path) and state input channels (partial scene reconstructions, shortest path distances from the agent, shortest path distances to the receptacle). Via ablation studies, we find empirically that our proposed state and action representations provide better performance and learning efficiency compared to alternative representations in simulation environments, and we show that our policies also work in real environments. We provide supplemental materials, including code, simulation environments, and videos of our robot in action, at

Fig. 2: Action representations. (a) shows all trajectories (green) available in a discrete set of steering commands. The other images show example trajectories available in spatial action maps (where every pixel represents an available trajectory endpoint). We consider two variants: (b) where the agent follows a straight-line path to the selected target endpoint (denoted by ’x’), and (c) where it follows an estimated shortest path (our method).

Ii Related Work

Fig. 3: Overview. We train in a simulation environment, where we mimic online SLAM/reconstruction as the agent moves around. This is implemented via a virtual forward-facing RGB-D camera mounted on the robot. The robot incrementally builds up a global overhead map of the room, as well as an occupancy map used to compute single-source shortest path distances. Locally oriented and aligned crops of these maps are used to train our fully convolutional Q-network, which produces a dense, pixel-aligned spatial action map (Q-value map).

Navigation. There has been significant recent work on training agents to navigate virtual environments using first-person visuomotor control [2, 15, 19, 27, 28, 37, 38, 39]. In a typical setup, the agent iteratively gets first-person observations as input (e.g., images with color, depth, normals, semantic segmentations, etc.), from which it builds a persistent state (e.g., a map), and selects one of many possible actions as output (e.g., move, rotate, strafe, etc.) until it completes a task (e.g., navigate to a given location, find a particular type of object, etc.). These works almost exclusively focus on how to train neural networks for the navigation tasks, for example using deep reinforcement learning [19, 47, 6, 25]

, supervised learning 

[34, 12], and predicting the future [9]. They do not study how different parameterizations of network outputs (actions) affect the learning performance – i.e., the inputs are always in one domain (e.g., images, GPS coordinates, etc.) and the outputs are in another (e.g., move forward, rotate right, interact, etc.). In contrast, we investigate the advantages of dense predictions using spatial action maps, where the inputs and outputs are represented in the same spatial domain.

Mobile Manipulation. While the majority of the navigation algorithms assume a static environment, other works also consider interaction with movable objects in the environment [18, 32, 35]. In these works, the agent navigates to the goal location by pushing aside movable obstacles that are in the way. For example,  Stilman and Kuffner [31] propose the task of Navigation Among Movable Obstacles, where the agent navigates in an environment that contains both static structures (e.g., walls and columns) and movable obstacles. While these methods do not assume a static environment, the task is similar – the agent only needs to navigate to the goal location. In contrast, in our task, the agent must come up with a navigational plan that will push all objects in the environment to the goal location, which is a much more complex problem.

Another line of work studies navigation for object manipulation [42, 17, 10, 23]. These tasks (e.g., picking and rearrangement of objects) are similar to ours, but the robot-object interactions considered in these tasks (e.g.

, grasping) are often more predictable and happen over short time horizons, which makes it possible to apply simple heuristic-based algorithms to separately handle navigation and interaction. In contrast, our setup requires long time-horizon robot-object interactions, which are less predictable and more difficult to plan.

Dense Action Representations. Our work is inspired by recent work on dense affordance prediction for bin picking. For example, the multi-affordance picking strategy of Zeng et al. [45] selects grasps by predicting a score for every pixel of an overhead view of a bin. Similar approaches are used to predict affordances for grasping or suction in [44, 46, 41, 30]. However, these systems are trained in the more constrained setting of bin picking, where supervision is available for grasp success, and where motion trajectories to achieve selected grasps can be assumed to be viable and of equal cost. In our work, we apply dense prediction to a more challenging scenario, where different actions have different costs (and some may not even be viable), and where long-term planning is required to perform complex sequences of actions.

Iii Methods

In this paper, we propose a new dense action representation in which each pixel of a bird’s-eye view image corresponds to the atomic action of navigating along the shortest path to the associated location in the scene.

To investigate this action representation, we consider a navigate-and-push setting in which an agent must explore an unseen environment, locate objects, and push them into a designated target “receptacle” (Fig. 3). This task may be seen as an abstraction of real-world tasks such as bulldozing debris or sweeping trash, and is sufficiently complex that it would be difficult to implement an effective hand-coded policy or learn a policy with traditional action representations.

Our agent is an Anki Vector, a low-cost mobile tracked robot approximately 10 cm in length, augmented with a 3D-printed bulldozer-like end-effector (Fig.

1). The objects to be pushed are 4.4 cm cubic blocks, and the receptacle is located in the top right corner of a 1 m by 0.5 m “room” with random partitions and obstacles. For ease of prototyping, we include fiducial markers on the robot and the objects, and track them with an overhead camera. Our setting though, is intended to represent what would be possible with on-board sensing and SLAM. Therefore, the only information made available to the agent is a simulation of what would be observed via an onboard forward-facing RGB-D camera with a 90 field of view, integrated over time with online mapping – i.e., our agents do not have access to ground truth global state.

Our agents are trained in a PyBullet simulation [8], where state observations are generated by synthetically rendering camera views of the environment. We then execute learned policies in the real world, where the fiducial markers are used to update the state of the simulator (e.g., robot and object poses). Simulation enables us to train our agents over a wider range of environments than would be possible with physical robots, while the sim-to-real mirroring enables us to evaluate the robustness and generalization of our policies to unseen dynamics in the real world (e.g., mass and friction).

In the following subsections, we provide details about our training setup and describe how the appropriate action representation proves crucial to reducing the training time and increasing the generalizability of our policies.

Iii-a Reinforcement Learning (DQN) Formulation

We formulate the navigate-and-push problem as a Markov decision process: given state

at time , the agent chooses to execute an action according to a policy , then transitions to a new state and receives a reward . The goal of reinforcement learning is to find an optimal policy that selects actions maximizing total expected rewards , i.e., a -discounted sum over an infinite horizon of future returns from time to .

In this work, we investigate off-policy Q-learning [20] to train a greedy policy that chooses actions by maximizing a parameterized Q-function (i.e., state-action value function), where denotes the weights of our neural network (whose architecture we describe in Sec. III-D). We train our agents using the double DQN learning objective [36]. Formally, at each training iteration , our objective is to minimize:

where is a transition uniformly sampled from the replay buffer. Target network parameters are held fixed between individual updates. More training details are presented in Sec. III-D.

Iii-B State Representation

Fig. 4: Input channels. From left to right, the pixel-aligned channels are all in the robot’s local coordinate frame and correspond to (1) the observation, which is an overhead image of the reconstructed environment, (2) encoded robot position in local coordinates, which is always in the center of the image, (3) shortest path distance from the agent to each position in the image, and (4) shortest path distance to the receptacle from each position in the image. The receptacle is in the corner of the room, near the top left of these images. Color added for visualization only, green/yellow corresponds to higher values.

Within our formulation, we represent the agent’s observation of the state as a visual 4-channel image from a local bird’s-eye view that is spatially aligned and oriented with respect to the robot’s coordinate frame (such that the robot is positioned at the center of each image and looking along the axis). This is similar to inverse perspective mapping (IPM), commonly used in autonomous driving [5, 7, 11, 29]. Each channel encodes useful information related to the environment [3] (visualized in Fig. 4) including: (1) a local observation of the robot’s surroundings in the form of an overhead map, (2) a binary mask with a circle whose diameter and position encodes the robot’s respective size and location in image coordinates, (3) an image where each pixel holds the shortest path distance from the agent to the corresponding location, and (4) an image where each pixel holds the shortest path distance to the receptacle from that location. The shortest path distances in the third and fourth channels are computed using an occupancy map generated only from local observations of obstacles (which are accumulated over time – see next paragraph) and normalized so that they contain relative values rather than absolute. All unobserved regions are treated as free space when computing shortest path distances. This reflects a realistic setting in which a robot has access to nothing but its own visual observations, GPS coordinates, local mapping, and task-related goal coordinates.

Online mapping. Online SLAM/reconstruction is a common component of any real mobile robot system. In our experiments, this is implemented in the simulation using images from a forward-facing RGB-D camera mounted on the robot. The camera captures a new image at the end of every transition. Using camera intrinsics and robot poses, depth data is projected into a 3D point cloud, then fused with previous views to generate and update a global map of the environment. At the beginning of each episode in a new environment, this global map is initially blank. As the robot moves around in the environment, the global map fills in as it accumulates more partial views over time. This restriction encourages the agent to learn a policy that can explore unseen areas in order to effectively complete the task. Specifically, the robot gradually reconstructs two global maps as it moves around: (1) a global overhead state map, which encodes objects and obstacles, and (2) a global occupancy map, used for shortest path computations in our high-level motion primitives as well as our state representations and partial rewards (see Sec. III-D). The agent makes no prior assumptions about the layout of obstacles in the environment.

Fig. 5: Action space. Our spatial action map includes shortest paths (c) from the current robot location to all positions in the environment (a), as estimated using an occupancy map from with online mapping (b).
(a) SmallEmpty (b) SmallColumns (c) LargeColumns (d) LargeDivider
Fig. 6: Environments. We tested in four types of environments, each with a receptacle in the top right corner (red), randomized arrangements of objects (yellow blocks), and obstacles (gray) of increasing difficulty (left to right).

Iii-C Action Representation

Our actions are represented by an image (i.e., action map, illustrated in Fig. 2) identically sized and spatially aligned to the input state representation. Each pixel in this action map corresponds to a navigational endpoint in local robot Cartesian coordinates, aligned to the scale of the visual observations in the state representation. At each time step, the agent selects a pixel location in the action map – and therefore in the observed environment – of where it intends to move to. Specifically, the selected location in the image indicates where the robot’s front-facing end effector should be located at after the action has been completed. The agent then uses a movement primitive to execute the move.

We experiment with two types of movement primitives: one that moves in a straight line towards the selected location, and one that follows the shortest path to the selected location. The straight line primitive simply turns the robot to face the selected location, and then moves forward until it has reached the location. The shortest path primitive uses the global occupancy map introduced in Sec. III-B to compute and follow the shortest path to the desired target location (Fig. 5). For both, it is possible for the robot to collide with previously unseen obstacles, in which case a reward penalty is incurred to the agent. Our experiments in Sec. IV-A compare this representation to discrete action alternatives (e.g., steering commands) commonly used in the literature. Note that in empty environments without obstacles, these two movement primitives are equivalent.

Iii-D Network Architecture and Training Details

We model our parameterized Q-function with a fully convolutional neural network (FCN), using a ResNet-18 [13] backbone for all experiments. The FCN takes as input the 4-channel image state representation (described in Sec. III-B) and outputs a state-action value map (described in Sec. III-C). We removed the AvgPool and fully connected layers at the end of the ResNet-18, and replaced them with 3 convolutional layers interleaved with bilinear upsampling layers. The added convolutional layers use 1x1 kernels, and the upsampling layers use a scale factor of 2. This gives us an output that is the same size as the input. We also removed all BatchNorm layers from our networks, which provided more training stability with small batch sizes. To ensure that the FCN has an adequate receptive field, we designed our observation crop size (96 x 96) such that the receptive field of each network output covers over a quarter of the input image area, and thus always covers the center of the image in which the robot is located.

Rewards. Our reward structure for reinforcement learning (computed after each transition) consists of three components: (1) a positive reward of +1 for each new object that moves into the designated receptacle (objects in the receptacle are removed thereafter), (2) partial rewards and penalties based on whether each object was moved closer to or further away from the receptacle (partial rewards are proportional to the change in distance computed using either Euclidean distances or shortest path distances – comparison in Sec. IV-A), and (3) small penalties for undesirable behaviors such as collision with obstacles, or for standing still.

Training details. During training, our agent interacts with the environment and stores data from each transition in an experience replay buffer of size 10,000. At each time step, we uniformly at random sample a batch of experiences from the experience replay buffer, and train our neural network with smooth L1 loss (i.e.

, Huber loss). We pass gradients only through the single pixel in the input state that corresponds to the selected action for a transition. We clip gradients such that they have magnitude at most 10. We train with batch size 32 and use stochastic gradient descent (SGD) with learning rate 0.01, momentum 0.9, and weight decay 0.0001. To account for varying distances traveled for different steps, we apply a discount factor

with an exponent that is proportional to the distance traveled during that step.

In our experiments, we train for 60,000 transitions, which typically corresponds to several hundred episodes. Each episode runs until either all objects in the environment have been pushed into the target receptacle, or the robot has not pushed an object into the receptacle for 100 steps. Our policies are trained from scratch, without any image-based pre-training [40]. The target network is updated every 1,000 steps. Training takes about 9 hours on a single NVIDIA Titan Xp GPU.

Exploration. Before training a network, we run a random policy for 1,000 steps to fill the replay buffer with some initial experience. Our exploration strategy is -greedy with initial , annealed over training to 0.01 after 6,000 transitions.

Iv Experiments

To test the proposed ideas, we run a series of experiments in both simulation and real-world environments. We first describe the simulation experiments, which are used to investigate trade-offs of different algorithms, and then we test our best algorithm on the physical robot.

Task. In every experiment, the robot is initialized with a random pose within a 3D environment enclosed by walls. Within the environment, there is a set of cubic objects scattered randomly throughout free space and a 15 cm by 15 cm square receptacle in the top right corner, which serves as the target destination for the objects. The robot’s task is to execute a sequence of actions that causes all of the objects to be pushed into the receptacle. Objects are removed from the environment after they are pushed into the receptacle.

Environments. We ran experiments with four virtual environments of increasing difficulty (Fig. 6). The first (SmallEmpty) is a small rectangular environment (1 m by 0.5 m) containing 10 randomly placed objects. The second (SmallColumns) is the same, but also contains a random number (1 to 3) of square (10 cm by 10 cm) fixed obstacles (like “columns”) placed randomly. The third (LargeColumns) is a similar but larger (1 m by 1 m) environment, with more columns (1 to 8), and more objects (20). The fourth (LargeDivider) is the same as the third, but replaces the columns with a single large divider that spans 80% of the dimension at a randomly chosen coordinate – this last environment requires the robot to plan how to get from one side of the divider to the other by going through the narrow open gap, and thus is the most difficult.

Fig. 7:

Plot of the number of objects pushed into the receptacle as a function of robot travel distance. These curves show evaluation of experiments that were trained with the LargeDivider environment (shown are means with shaded bars for standard deviations). Our method (orange) is slightly better than its ablation variants with restricted actions, and significantly better than the baseline method trained using steering commands (purple). The blue curve in the top left indicates the minimum distance an oracle robot could possibly travel if it had the ability to carry every object to the trash – it provides a reference for the scale of the problem.

Environment Ours No shortest path Fixed Steering
movement step size commands
SmEmpty 19.91 0.11 n/a 19.75 0.20 1.38 0.20
SmColumns 19.18 0.14 17.88 0.70 19.05 0.38 0.82 0.33
LgColumns 18.29 0.45 14.70 1.52 17.52 0.82 1.20 0.64
LgDivider 18.23 0.92 15.66 1.44 15.56 2.01 4.14 2.21

(Number of objects pushed into the receptacle per episode)

TABLE I: Effect of Action Representation

Evaluation metrics. We evaluate each model by running the trained agent for 20 episodes in the environment it was trained on. Since the environments are randomly generated every episode, for each model, we set the random seed so that the exact same set of generated environments are presented to each model, including the initial robot pose, object poses, and obstacle placements. For all experiments, we do 5 training runs with the same setup and report the mean and standard deviation across the 5 runs.

We use two evaluation metrics. The first simply measures the number of objects that have been pushed into the receptacle at the end of an episode (Tab. I). The second plots the number of objects successfully pushed into the receptacle as a function of how far the robot has moved (Fig. 7). Additionally, we compare sample efficiency of the training by plotting objects per episode on the training environments, as a function of training steps (Fig. 8). Higher numbers are better.

Fig. 8: We show here training curves for the LargeDivider environment. Our agent trained with spatial action map (orange curve) is significantly more sample efficient compared to the baseline that uses steering commands (purple).

Iv-a Simulation Experiments

Comparison to baseline. Our first set of experiments is designed to evaluate how spatial action maps compare to more traditional action formulations. To investigate this question, we ran experiments with an 18-dimensional discrete action space consisting of steering commands discretized into 18 turn angles. We created a modified (baseline) version of our system by replacing the last layer of our network with a fully connected layer that outputs the predicted Q-value for rotating by each of the 18 possible angles and then taking a step forward by 25 cm. We also added two additional channels to the state representation to encode the relative position of every pixel location. This modified network mimics the DQN architectures and actions typical of other navigation algorithms [2, 15, 19, 27, 28, 37, 38, 39], and yet is the same as ours in all other aspects.

Results are shown in Tab. I. We find that the steering commands baseline algorithm (right) is unable to learn effectively in any of the four environments. We conjecture the reasons are two-fold. First, the baseline network must learn a mapping from observations to action classes, which may be harder than the dense prediction enabled by spatial action maps. Second, the baseline agent can only reap rewards by executing a long sequence of short steps, and so it is difficult for the algorithm to achieve any reward at all in the early phases of training (Fig. 8). In contrast, our method utilizes more complex actions (that go all the way to the selected location along the shortest path), and thus can discover rewards with fewer actions. This results in a policy that performs the task more completely with less motion (Fig. 7).

Effect of actions based on shortest path trajectories. To test the hypothesis that shortest path trajectory actions help our algorithm learn more efficiently, we ran a second experiment using our full system with a small modification: the action primitive requires the robot to move in a straight line to the selected target location (rather than along the shortest path). The result of this experiment (“No shortest path movement” in Tab. I, and green curve in Fig. 7) show a degradation in performance compared to ours. Even though shortest paths are computed based on partial maps reconstructed from past observations (and thus may be wrong), using them to as our movement primitive seems to help the rate of learning.

Effect of actions with fixed step size. To test the hypothesis that action primitives with longer trajectories help our algorithm learn more efficiently, we ran a third experiment using our system with a different small modification: the length of any trajectory is limited to a small step (25 cm). That is, the same dense network predicts full Q-values for a dense space of actions as usual, but the agent can only step a fixed distance in the direction of the position with the highest Q-value at each iteration. The result of this variant (“Fixed step size” in Tab. I and red curve in Fig. 7) shows that taking shorter steps indeed degrades performance. Even though it would be possible for the agent to take many small steps to achieve the same trajectory as a single long one, we find that it learns more quickly with longer trajectories. We conjecture that this could be due to inconsistencies in Q-values predicted from different perspectives, as well as a less direct mapping from visual features to dense actions, causing the agent to waver between different endpoint targets as it makes many small steps.

Effect of shortest path input channels. In addition to an overhead image, our system provides the agent with three additional input image channels: (1) an image with a dot at the robot position, (2) an image with shortest path distances from the agent’s position, and (3) an image with shortest path distances to the receptacle. To test whether the latter two of these channels are helpful, we ran ablation studies without each of them. The results (Tab. IV) show that both channels provide little benefit in the small and empty environments, but help the system train more effectively in the most challenging environment (LargeDividers). Giving the agent input channels with estimated shortest paths seems to help it to choose target locations with objects that can be reached more easily (without going around large obstacles).

Effect of shortest path partial rewards. We train our agents with partial rewards given for pushing objects closer to or further away from the receptacle. These partial rewards are proportional to each object’s change in distance from the receptacle during a step. We hypothesize that in environments with obstacles, it is important to give partial rewards based on changes in shortest path distance rather than Euclidean distance from the receptacle. We verify by running ablations that use Euclidean distance rather than shortest path distance, shown in Tab. IV. We observe that agents trained with Euclidean distance partial rewards indeed perform worse, particularly for the LargeDivider environment, where true shortest path distances to the receptacle can be much larger than Euclidean distances. We believe that giving partial rewards based on shortest path distances provides a better training signal to the agent. While Euclidean distances indicate whether an object is close to the receptacle, shortest path distances additionally factor in the need to push the object around the obstacles to get to the target receptacle.

(a) Ours (b) Ours w/o shortest path components (c) Steering commands
Fig. 9: Episode trajectories. In this example, our method (a) follows a trajectory (blue line) that clears all objects efficiently. If we disable all shortest path components (b), it pushes objects straight towards the receptacle and they end up piled against the divider. If we use the baseline action representation based on steering commands (c), the agent never learns to navigate effectively in this difficult environment.
Environment Ours No shortest path No shortest path
from agent to receptacle
SmallEmpty 19.91 0.11 19.82 0.21 n/a
SmallColumns 19.18 0.14 19.18 0.32 19.20 0.22
LargeColumns 18.29 0.45 18.40 0.88 18.88 0.49
LargeDivider 18.23 0.92 16.87 1.97 16.71 1.49
TABLE III: Effect of Removing All Shortest Path Components
Environment Ours Ours Steering Steering
no shortest path no shortest path
SmEmpty 19.91 0.11 19.82 0.10 1.38 0.20 1.23 0.69
SmColumns 19.18 0.14 18.05 0.29 0.82 0.33 0.72 0.44
LgColumns 18.29 0.45 15.63 1.17 1.20 0.64 0.33 0.20
LgDivider 18.23 0.92 10.06 0.89 4.14 2.21 0.26 0.12
TABLE IV: Effect of Shortest Path Partial Rewards
Environment Ours No shortest path
in partial rewards
SmallColumns 19.18 0.14 19.13 0.28
LargeColumns 18.29 0.45 17.89 0.97
LargeDivider 18.23 0.92 16.87 1.97
TABLE II: Effect of Shortest Path Input Channels

Effect of removing all shortest path computations. Here we remove all shortest path components of our system, specifically (1) shortest path movement primitive, (2) shortest path channels, and (3) shortest path partial rewards. We replace them with their straight-line variants: (1) straight line movement, (2) channel containing Euclidean distance to receptacle, and (3) Euclidean distance partial rewards. The results are shown in Tab. IV. Indeed we see that for the SmallEmpty environment, there is no significant difference because there are no obstacles present. However, as the difficulty of the obstacles increases (LargeDivider), we see that our method is much better at handling the obstacles. The difference can be seen clearly in the visualizations of example trajectories shown in Fig. 9. Our system pushes objects quite efficiently along shortest paths trajectories through free space (left). In contrast, the ablative version without shortest paths fails to learn how to avoid obstacles and continually pushes objects up against the divider wall.

We similarly run the same ablations for the steering commands baseline. Specifically, we remove the shortest path channels and shortest path partial rewards, and replace them with straight-line variants. We find that while the baseline has some ability to handle obstacles when given shortest path channels and shortest path partial rewards, the performance on obstacle environments can be dramatically worse without these shortest path components (Tab. IV).

Iv-B Real-World Experiments

We conduct experiments on the physical Anki Vector robots by replicating the simulation environment on a tabletop. Our setup can be seen in Fig. 10. We mount a camera over the tabletop, and affix fiducial markers to the robot and the objects, as well as the corners of the room. Using the markers and the overhead camera, we obtain real-time millimeter-level pose estimates of the robots and objects, which we then map into our simulation environment whenever there is any change.

In this way, we enable our simulation environment to mirror the real-world environment. This means we can test our policies, which were trained in simulation, directly in the real-world setup. Given a state representation rendered by the simulation, our trained policy outputs a high-level action, which is executed on the physical robot by a low-level controller. Overall, we find that trained agent behavior in the real-world environment is qualitatively similar to the simulation, but not quite as efficient due to differences in physical dynamics. We tested our best model on the SmallEmpty environment, and averaged across 5 test episodes, the real robot is able to push 8.4 out of 10 objects into the receptacle within 15 minutes, and all 10 within 30 minutes. We show videos of our robot executing policies (that were learned in simulation) in the real-world environment at

Looking at the videos, it is interesting to observe the emergent behaviors learned by the robot. Perhaps the most common is to first push objects up against a wall with short moves, and then later “sweep” multiple objects together along the wall with a single long trajectory ending at the receptacle (note the long paths parallel to the wall in Fig. 9). This behavior is depicted in Fig. 10, where a sequence of actions from one episode is shown in time-lapse as the robot pushes two objects at once (four near the end). Additionally, other emergent behaviors include retrying, where the the robot makes multiple attempts to nudge objects towards the receptacle after initial failure.

Fig. 10: Emergent behaviors. An Anki Vector robot executes a policy learned in simulation. It first pushes objects towards the wall, then uses the wall as a guide to simultaneously push several objects until they are all in the receptacle.

Iv-C Limitations

Limitations of our setting and experiments. Because of our sim-to-real setup, our results are limited by the accuracy of our simulations (including the measurement accuracy of physical properties in our setup), the quality and robustness of our motion primitive implementations, and our assumptions of perfect localization and mapping. The first two of these are inherent to any system that is trained in simulation, and could be compensated by fine-tuning the learned model in the physical setup. How uncertainty in state and observations should be incorporated into planning is an active research topic, particularly for deep reinforcement learning, and is orthogonal to our investigation into action space representations.

Limitations of a dense action space. One inherent limitation of our spatial action maps is that they focus only on a higher-level planning problem, and assume that lower-level control will be handled separately. Concretely, we output a destination that the agent should attempt to reach, and possibly what it should attempt to do once it gets there, and rely on a motion primitive to make that happen. (In our case, the motion primitive is hand-coded, but it could equally well be learned separately.) Other action representations, in contrast, effectively integrate learning of long-range planning and immediate control end-to-end. An interesting topic for future investigation would be whether it is better to separate or integrate these stages, or to explore a hybrid strategy combining end-to-end learning with an intermediate loss on a dense action space.

V Conclusion

In summary, we propose “spatial action maps” as an action space for mobile manipulation, and study how best to use them in our setting of training RL agents to push objects to a goal location. In our experiments, we find that an FCN can be trained to predict maps of Q-values in the spatial action map (from pixel-aligned images of input states, which encode reconstructed scene geometry and estimated shortest-path distances) more efficiently than alternative methods. We also demonstrate that our agents which were trained in simulation can transfer directly to a real-world environment without further training. Of course, this paper provides one step in a broader investigation of possible action representations in one application domain. In future work, it will be interesting to study other possible action spaces (e.g., ones that mix navigation and manipulation) and other application domains (e.g., autonomous driving).


The authors would like to thank Naveen Verma, Naomi Leonard, Anirudha Majumdar, Stefan Welker, and Yen-Chen Lin for fruitful technical discussions, as well as Julian Salazar for hardware support. This work was supported in part by the Princeton School of Engineering, as well as the National Science Foundation under IIS-1617236, IIS-1815070, and DGE-1656466.