The prolific success of deep learning in vision[13, 14, 16, 33] has inspired much recent work on using deep convolutional networks (ConvNets) for visual navigation and mobile manipulation in robotics. In such an approach, ConvNets are trained to model a policy that maps from an agent’s observation of the state (e.g.
, images) to the probability distribution over actions (e.g., steering commands) maximizing expected task success.
, to discovering more complex long-term path planning behaviors with reinforcement learning[7, 20, 29, 47]. Other work has considered the impact of different state representations: from front-facing camera images [4, 22, 24, 26] to top-down bird’s-eye views (BEV) of the scene generated with Inverse Perspective Mapping (IPM) [5, 7, 11, 29]. However, there has been very little work on the impact of different action representations. Almost all navigation systems based on ConvNets consider only a small set of (possibly parameterized) egocentric steering commands (e.g., move forward/backward, strafe right/left, rotate left/right, etc.) [7, 11, 12, 20, 24, 26, 29, 47].
Representing actions with steering commands presents several problems for learning complex mobile manipulation tasks (Fig. 2a). First, they are myopic, and thus each action can reach only a small subset of possible end-points (i.e., ones at a fixed distance and reachable by unobstructed straight-line paths). Second, they each invoke only a small change to the state, and thus a long sequence of actions is required to make significant change (e.g., navigate through a series of obstacles). Third, they require a deep Q-network to learn a complex mapping from a high-dimensional input state representation (usually an image) to a low-dimensional set of (possibly parameterized) action classes, which may require many training examples.
In this paper, we advocate for a new action representation, “spatial action maps” (Fig. 1). The main idea is to represent actions as a dense map of navigation end-points: each action represents a move to an end-point, possibly along a non-linear trajectory, and possibly with a task to perform there (Fig. 2b-c). The advantages of this action representation are three-fold. First, the agent is not myopic – it can move to end-points in a single action . Second, each single action can represent an arbitrarily complex navigation trajectory – ours follow shortest paths to end-points while avoiding obstacles (Fig. 2c). Third, it simplifies the mapping from states to actions – in our system, each state is represented by an image of the reconstructed scene from a bird’s-eye view (in IPM coordinates), and the action space is represented by an image representing the expected Q-value of navigating to every end-point (also in IPM coordinates). Since the state and action space lie in the same domain and are pixel-aligned, we can train a fully convolutional network (FCN) to map between them – an efficient way to predict values for all possible actions in one forward pass of a network.
We study this action representation in the context of reinforcement learning for mobile manipulation, where an agent is tasked with exploring an unseen environment containing obstacles and objects, with the goal of navigating to push all objects into a designated target zone – i.e., pushing the objects into a receptacle like a bulldozer (Fig. 1
). At every step, the only information available to the agent is a bird’s-eye view image representing a partial reconstruction of its local environment (everything it has observed with a front-facing RGB-D camera, transformed into an IPM representation and accumulated over time to simulate online SLAM/reconstruction). The agent feeds the state image into an FCN to produce an action image encoding the Q-value of moving along the estimated shortest path trajectory to every end-point location. It executes the action with highest Q-value and iterates.
The spatial action map representation has the key benefit that the spatial position of each state-action value prediction (with respect to the input IPM view) represents a local milestone (trajectory end-point) for the agent’s control strategy. We conjecture that it is easier for convolutional networks to learn the state-action values of these navigational end-points (as opposed to abstract low-level steering commands), since each prediction is aligned and translationally anchored to the visual features of the objects and obstacles directly observed in the input map. This is motivated by gains in performance observed in other domains – for example, image segmentation  has been shown to benefit from pixel-aligned input and output representations, while end-to-end robotic manipulation [21, 45, 44, 43] is significantly more sample efficient when using FCNs to predict state-action values for a dense sampling of grasps aligned with the visual input. Using spatial action maps, our experiments show that we are capable of training end-to-end mobile manipulation policies that generalize to new environments with fewer than 60k training samples (state-action pairs). This is orders of magnitude less data than prior work [20, 47].
Our main contribution in this paper is the spatial action map representation for mobile manipulation. We not only propose the representation, but also investigate its use with a variety of action primitives (take a short step, follow the straight-line path, follow the shortest path) and state input channels (partial scene reconstructions, shortest path distances from the agent, shortest path distances to the receptacle). Via ablation studies, we find empirically that our proposed state and action representations provide better performance and learning efficiency compared to alternative representations in simulation environments, and we show that our policies also work in real environments. We provide supplemental materials, including code, simulation environments, and videos of our robot in action, at https://spatial-action-maps.cs.princeton.edu.
Ii Related Work
Navigation. There has been significant recent work on training agents to navigate virtual environments using first-person visuomotor control [2, 15, 19, 27, 28, 37, 38, 39]. In a typical setup, the agent iteratively gets first-person observations as input (e.g., images with color, depth, normals, semantic segmentations, etc.), from which it builds a persistent state (e.g., a map), and selects one of many possible actions as output (e.g., move, rotate, strafe, etc.) until it completes a task (e.g., navigate to a given location, find a particular type of object, etc.). These works almost exclusively focus on how to train neural networks for the navigation tasks, for example using deep reinforcement learning [19, 47, 6, 25]34, 12], and predicting the future . They do not study how different parameterizations of network outputs (actions) affect the learning performance – i.e., the inputs are always in one domain (e.g., images, GPS coordinates, etc.) and the outputs are in another (e.g., move forward, rotate right, interact, etc.). In contrast, we investigate the advantages of dense predictions using spatial action maps, where the inputs and outputs are represented in the same spatial domain.
Mobile Manipulation. While the majority of the navigation algorithms assume a static environment, other works also consider interaction with movable objects in the environment [18, 32, 35]. In these works, the agent navigates to the goal location by pushing aside movable obstacles that are in the way. For example, Stilman and Kuffner  propose the task of Navigation Among Movable Obstacles, where the agent navigates in an environment that contains both static structures (e.g., walls and columns) and movable obstacles. While these methods do not assume a static environment, the task is similar – the agent only needs to navigate to the goal location. In contrast, in our task, the agent must come up with a navigational plan that will push all objects in the environment to the goal location, which is a much more complex problem.
Another line of work studies navigation for object manipulation [42, 17, 10, 23]. These tasks (e.g., picking and rearrangement of objects) are similar to ours, but the robot-object interactions considered in these tasks (e.g.
, grasping) are often more predictable and happen over short time horizons, which makes it possible to apply simple heuristic-based algorithms to separately handle navigation and interaction. In contrast, our setup requires long time-horizon robot-object interactions, which are less predictable and more difficult to plan.
Dense Action Representations. Our work is inspired by recent work on dense affordance prediction for bin picking. For example, the multi-affordance picking strategy of Zeng et al.  selects grasps by predicting a score for every pixel of an overhead view of a bin. Similar approaches are used to predict affordances for grasping or suction in [44, 46, 41, 30]. However, these systems are trained in the more constrained setting of bin picking, where supervision is available for grasp success, and where motion trajectories to achieve selected grasps can be assumed to be viable and of equal cost. In our work, we apply dense prediction to a more challenging scenario, where different actions have different costs (and some may not even be viable), and where long-term planning is required to perform complex sequences of actions.
In this paper, we propose a new dense action representation in which each pixel of a bird’s-eye view image corresponds to the atomic action of navigating along the shortest path to the associated location in the scene.
To investigate this action representation, we consider a navigate-and-push setting in which an agent must explore an unseen environment, locate objects, and push them into a designated target “receptacle” (Fig. 3). This task may be seen as an abstraction of real-world tasks such as bulldozing debris or sweeping trash, and is sufficiently complex that it would be difficult to implement an effective hand-coded policy or learn a policy with traditional action representations.
Our agent is an Anki Vector, a low-cost mobile tracked robot approximately 10 cm in length, augmented with a 3D-printed bulldozer-like end-effector (Fig.1). The objects to be pushed are 4.4 cm cubic blocks, and the receptacle is located in the top right corner of a 1 m by 0.5 m “room” with random partitions and obstacles. For ease of prototyping, we include fiducial markers on the robot and the objects, and track them with an overhead camera. Our setting though, is intended to represent what would be possible with on-board sensing and SLAM. Therefore, the only information made available to the agent is a simulation of what would be observed via an onboard forward-facing RGB-D camera with a 90 field of view, integrated over time with online mapping – i.e., our agents do not have access to ground truth global state.
Our agents are trained in a PyBullet simulation , where state observations are generated by synthetically rendering camera views of the environment. We then execute learned policies in the real world, where the fiducial markers are used to update the state of the simulator (e.g., robot and object poses). Simulation enables us to train our agents over a wider range of environments than would be possible with physical robots, while the sim-to-real mirroring enables us to evaluate the robustness and generalization of our policies to unseen dynamics in the real world (e.g., mass and friction).
In the following subsections, we provide details about our training setup and describe how the appropriate action representation proves crucial to reducing the training time and increasing the generalizability of our policies.
Iii-a Reinforcement Learning (DQN) Formulation
We formulate the navigate-and-push problem as a Markov decision process: given stateat time , the agent chooses to execute an action according to a policy , then transitions to a new state and receives a reward . The goal of reinforcement learning is to find an optimal policy that selects actions maximizing total expected rewards , i.e., a -discounted sum over an infinite horizon of future returns from time to .
In this work, we investigate off-policy Q-learning  to train a greedy policy that chooses actions by maximizing a parameterized Q-function (i.e., state-action value function), where denotes the weights of our neural network (whose architecture we describe in Sec. III-D). We train our agents using the double DQN learning objective . Formally, at each training iteration , our objective is to minimize:
where is a transition uniformly sampled from the replay buffer. Target network parameters are held fixed between individual updates. More training details are presented in Sec. III-D.
Iii-B State Representation
Within our formulation, we represent the agent’s observation of the state as a visual 4-channel image from a local bird’s-eye view that is spatially aligned and oriented with respect to the robot’s coordinate frame (such that the robot is positioned at the center of each image and looking along the axis). This is similar to inverse perspective mapping (IPM), commonly used in autonomous driving [5, 7, 11, 29]. Each channel encodes useful information related to the environment  (visualized in Fig. 4) including: (1) a local observation of the robot’s surroundings in the form of an overhead map, (2) a binary mask with a circle whose diameter and position encodes the robot’s respective size and location in image coordinates, (3) an image where each pixel holds the shortest path distance from the agent to the corresponding location, and (4) an image where each pixel holds the shortest path distance to the receptacle from that location. The shortest path distances in the third and fourth channels are computed using an occupancy map generated only from local observations of obstacles (which are accumulated over time – see next paragraph) and normalized so that they contain relative values rather than absolute. All unobserved regions are treated as free space when computing shortest path distances. This reflects a realistic setting in which a robot has access to nothing but its own visual observations, GPS coordinates, local mapping, and task-related goal coordinates.
Online mapping. Online SLAM/reconstruction is a common component of any real mobile robot system. In our experiments, this is implemented in the simulation using images from a forward-facing RGB-D camera mounted on the robot. The camera captures a new image at the end of every transition. Using camera intrinsics and robot poses, depth data is projected into a 3D point cloud, then fused with previous views to generate and update a global map of the environment. At the beginning of each episode in a new environment, this global map is initially blank. As the robot moves around in the environment, the global map fills in as it accumulates more partial views over time. This restriction encourages the agent to learn a policy that can explore unseen areas in order to effectively complete the task. Specifically, the robot gradually reconstructs two global maps as it moves around: (1) a global overhead state map, which encodes objects and obstacles, and (2) a global occupancy map, used for shortest path computations in our high-level motion primitives as well as our state representations and partial rewards (see Sec. III-D). The agent makes no prior assumptions about the layout of obstacles in the environment.
|(a) SmallEmpty||(b) SmallColumns||(c) LargeColumns||(d) LargeDivider|
Iii-C Action Representation
Our actions are represented by an image (i.e., action map, illustrated in Fig. 2) identically sized and spatially aligned to the input state representation. Each pixel in this action map corresponds to a navigational endpoint in local robot Cartesian coordinates, aligned to the scale of the visual observations in the state representation. At each time step, the agent selects a pixel location in the action map – and therefore in the observed environment – of where it intends to move to. Specifically, the selected location in the image indicates where the robot’s front-facing end effector should be located at after the action has been completed. The agent then uses a movement primitive to execute the move.
We experiment with two types of movement primitives: one that moves in a straight line towards the selected location, and one that follows the shortest path to the selected location. The straight line primitive simply turns the robot to face the selected location, and then moves forward until it has reached the location. The shortest path primitive uses the global occupancy map introduced in Sec. III-B to compute and follow the shortest path to the desired target location (Fig. 5). For both, it is possible for the robot to collide with previously unseen obstacles, in which case a reward penalty is incurred to the agent. Our experiments in Sec. IV-A compare this representation to discrete action alternatives (e.g., steering commands) commonly used in the literature. Note that in empty environments without obstacles, these two movement primitives are equivalent.
Iii-D Network Architecture and Training Details
We model our parameterized Q-function with a fully convolutional neural network (FCN), using a ResNet-18  backbone for all experiments. The FCN takes as input the 4-channel image state representation (described in Sec. III-B) and outputs a state-action value map (described in Sec. III-C). We removed the AvgPool and fully connected layers at the end of the ResNet-18, and replaced them with 3 convolutional layers interleaved with bilinear upsampling layers. The added convolutional layers use 1x1 kernels, and the upsampling layers use a scale factor of 2. This gives us an output that is the same size as the input. We also removed all BatchNorm layers from our networks, which provided more training stability with small batch sizes. To ensure that the FCN has an adequate receptive field, we designed our observation crop size (96 x 96) such that the receptive field of each network output covers over a quarter of the input image area, and thus always covers the center of the image in which the robot is located.
Rewards. Our reward structure for reinforcement learning (computed after each transition) consists of three components: (1) a positive reward of +1 for each new object that moves into the designated receptacle (objects in the receptacle are removed thereafter), (2) partial rewards and penalties based on whether each object was moved closer to or further away from the receptacle (partial rewards are proportional to the change in distance computed using either Euclidean distances or shortest path distances – comparison in Sec. IV-A), and (3) small penalties for undesirable behaviors such as collision with obstacles, or for standing still.
Training details. During training, our agent interacts with the environment and stores data from each transition in an experience replay buffer of size 10,000. At each time step, we uniformly at random sample a batch of experiences from the experience replay buffer, and train our neural network with smooth L1 loss (i.e.
, Huber loss). We pass gradients only through the single pixel in the input state that corresponds to the selected action for a transition. We clip gradients such that they have magnitude at most 10. We train with batch size 32 and use stochastic gradient descent (SGD) with learning rate 0.01, momentum 0.9, and weight decay 0.0001. To account for varying distances traveled for different steps, we apply a discount factorwith an exponent that is proportional to the distance traveled during that step.
In our experiments, we train for 60,000 transitions, which typically corresponds to several hundred episodes. Each episode runs until either all objects in the environment have been pushed into the target receptacle, or the robot has not pushed an object into the receptacle for 100 steps. Our policies are trained from scratch, without any image-based pre-training . The target network is updated every 1,000 steps. Training takes about 9 hours on a single NVIDIA Titan Xp GPU.
Exploration. Before training a network, we run a random policy for 1,000 steps to fill the replay buffer with some initial experience. Our exploration strategy is -greedy with initial , annealed over training to 0.01 after 6,000 transitions.
To test the proposed ideas, we run a series of experiments in both simulation and real-world environments. We first describe the simulation experiments, which are used to investigate trade-offs of different algorithms, and then we test our best algorithm on the physical robot.
Task. In every experiment, the robot is initialized with a random pose within a 3D environment enclosed by walls. Within the environment, there is a set of cubic objects scattered randomly throughout free space and a 15 cm by 15 cm square receptacle in the top right corner, which serves as the target destination for the objects. The robot’s task is to execute a sequence of actions that causes all of the objects to be pushed into the receptacle. Objects are removed from the environment after they are pushed into the receptacle.
Environments. We ran experiments with four virtual environments of increasing difficulty (Fig. 6). The first (SmallEmpty) is a small rectangular environment (1 m by 0.5 m) containing 10 randomly placed objects. The second (SmallColumns) is the same, but also contains a random number (1 to 3) of square (10 cm by 10 cm) fixed obstacles (like “columns”) placed randomly. The third (LargeColumns) is a similar but larger (1 m by 1 m) environment, with more columns (1 to 8), and more objects (20). The fourth (LargeDivider) is the same as the third, but replaces the columns with a single large divider that spans 80% of the dimension at a randomly chosen coordinate – this last environment requires the robot to plan how to get from one side of the divider to the other by going through the narrow open gap, and thus is the most difficult.
|Environment||Ours||No shortest path||Fixed||Steering|
|SmEmpty||9.91 0.11||n/a||9.75 0.20||1.38 0.20|
|SmColumns||9.18 0.14||7.88 0.70||9.05 0.38||0.82 0.33|
|LgColumns||18.29 0.45||14.70 1.52||17.52 0.82||1.20 0.64|
|LgDivider||18.23 0.92||15.66 1.44||15.56 2.01||4.14 2.21|
(Number of objects pushed into the receptacle per episode)
Evaluation metrics. We evaluate each model by running the trained agent for 20 episodes in the environment it was trained on. Since the environments are randomly generated every episode, for each model, we set the random seed so that the exact same set of generated environments are presented to each model, including the initial robot pose, object poses, and obstacle placements. For all experiments, we do 5 training runs with the same setup and report the mean and standard deviation across the 5 runs.
We use two evaluation metrics. The first simply measures the number of objects that have been pushed into the receptacle at the end of an episode (Tab. I). The second plots the number of objects successfully pushed into the receptacle as a function of how far the robot has moved (Fig. 7). Additionally, we compare sample efficiency of the training by plotting objects per episode on the training environments, as a function of training steps (Fig. 8). Higher numbers are better.
Iv-a Simulation Experiments
Comparison to baseline. Our first set of experiments is designed to evaluate how spatial action maps compare to more traditional action formulations. To investigate this question, we ran experiments with an 18-dimensional discrete action space consisting of steering commands discretized into 18 turn angles. We created a modified (baseline) version of our system by replacing the last layer of our network with a fully connected layer that outputs the predicted Q-value for rotating by each of the 18 possible angles and then taking a step forward by 25 cm. We also added two additional channels to the state representation to encode the relative position of every pixel location. This modified network mimics the DQN architectures and actions typical of other navigation algorithms [2, 15, 19, 27, 28, 37, 38, 39], and yet is the same as ours in all other aspects.
Results are shown in Tab. I. We find that the steering commands baseline algorithm (right) is unable to learn effectively in any of the four environments. We conjecture the reasons are two-fold. First, the baseline network must learn a mapping from observations to action classes, which may be harder than the dense prediction enabled by spatial action maps. Second, the baseline agent can only reap rewards by executing a long sequence of short steps, and so it is difficult for the algorithm to achieve any reward at all in the early phases of training (Fig. 8). In contrast, our method utilizes more complex actions (that go all the way to the selected location along the shortest path), and thus can discover rewards with fewer actions. This results in a policy that performs the task more completely with less motion (Fig. 7).
Effect of actions based on shortest path trajectories. To test the hypothesis that shortest path trajectory actions help our algorithm learn more efficiently, we ran a second experiment using our full system with a small modification: the action primitive requires the robot to move in a straight line to the selected target location (rather than along the shortest path). The result of this experiment (“No shortest path movement” in Tab. I, and green curve in Fig. 7) show a degradation in performance compared to ours. Even though shortest paths are computed based on partial maps reconstructed from past observations (and thus may be wrong), using them to as our movement primitive seems to help the rate of learning.
Effect of actions with fixed step size. To test the hypothesis that action primitives with longer trajectories help our algorithm learn more efficiently, we ran a third experiment using our system with a different small modification: the length of any trajectory is limited to a small step (25 cm). That is, the same dense network predicts full Q-values for a dense space of actions as usual, but the agent can only step a fixed distance in the direction of the position with the highest Q-value at each iteration. The result of this variant (“Fixed step size” in Tab. I and red curve in Fig. 7) shows that taking shorter steps indeed degrades performance. Even though it would be possible for the agent to take many small steps to achieve the same trajectory as a single long one, we find that it learns more quickly with longer trajectories. We conjecture that this could be due to inconsistencies in Q-values predicted from different perspectives, as well as a less direct mapping from visual features to dense actions, causing the agent to waver between different endpoint targets as it makes many small steps.
Effect of shortest path input channels. In addition to an overhead image, our system provides the agent with three additional input image channels: (1) an image with a dot at the robot position, (2) an image with shortest path distances from the agent’s position, and (3) an image with shortest path distances to the receptacle. To test whether the latter two of these channels are helpful, we ran ablation studies without each of them. The results (Tab. IV) show that both channels provide little benefit in the small and empty environments, but help the system train more effectively in the most challenging environment (LargeDividers). Giving the agent input channels with estimated shortest paths seems to help it to choose target locations with objects that can be reached more easily (without going around large obstacles).
Effect of shortest path partial rewards. We train our agents with partial rewards given for pushing objects closer to or further away from the receptacle. These partial rewards are proportional to each object’s change in distance from the receptacle during a step. We hypothesize that in environments with obstacles, it is important to give partial rewards based on changes in shortest path distance rather than Euclidean distance from the receptacle. We verify by running ablations that use Euclidean distance rather than shortest path distance, shown in Tab. IV. We observe that agents trained with Euclidean distance partial rewards indeed perform worse, particularly for the LargeDivider environment, where true shortest path distances to the receptacle can be much larger than Euclidean distances. We believe that giving partial rewards based on shortest path distances provides a better training signal to the agent. While Euclidean distances indicate whether an object is close to the receptacle, shortest path distances additionally factor in the need to push the object around the obstacles to get to the target receptacle.
|(a) Ours||(b) Ours w/o shortest path components||(c) Steering commands|
|Environment||Ours||No shortest path||No shortest path|
|from agent||to receptacle|
|SmallEmpty||9.91 0.11||9.82 0.21||n/a|
|SmallColumns||9.18 0.14||9.18 0.32||9.20 0.22|
|LargeColumns||18.29 0.45||18.40 0.88||18.88 0.49|
|LargeDivider||18.23 0.92||16.87 1.97||16.71 1.49|
|no shortest path||no shortest path|
|SmEmpty||9.91 0.11||9.82 0.10||1.38 0.20||1.23 0.69|
|SmColumns||9.18 0.14||8.05 0.29||0.82 0.33||0.72 0.44|
|LgColumns||18.29 0.45||15.63 1.17||1.20 0.64||0.33 0.20|
|LgDivider||18.23 0.92||10.06 0.89||4.14 2.21||0.26 0.12|
|Environment||Ours||No shortest path|
|in partial rewards|
|SmallColumns||9.18 0.14||9.13 0.28|
|LargeColumns||18.29 0.45||17.89 0.97|
|LargeDivider||18.23 0.92||16.87 1.97|
Effect of removing all shortest path computations. Here we remove all shortest path components of our system, specifically (1) shortest path movement primitive, (2) shortest path channels, and (3) shortest path partial rewards. We replace them with their straight-line variants: (1) straight line movement, (2) channel containing Euclidean distance to receptacle, and (3) Euclidean distance partial rewards. The results are shown in Tab. IV. Indeed we see that for the SmallEmpty environment, there is no significant difference because there are no obstacles present. However, as the difficulty of the obstacles increases (LargeDivider), we see that our method is much better at handling the obstacles. The difference can be seen clearly in the visualizations of example trajectories shown in Fig. 9. Our system pushes objects quite efficiently along shortest paths trajectories through free space (left). In contrast, the ablative version without shortest paths fails to learn how to avoid obstacles and continually pushes objects up against the divider wall.
We similarly run the same ablations for the steering commands baseline. Specifically, we remove the shortest path channels and shortest path partial rewards, and replace them with straight-line variants. We find that while the baseline has some ability to handle obstacles when given shortest path channels and shortest path partial rewards, the performance on obstacle environments can be dramatically worse without these shortest path components (Tab. IV).
Iv-B Real-World Experiments
We conduct experiments on the physical Anki Vector robots by replicating the simulation environment on a tabletop. Our setup can be seen in Fig. 10. We mount a camera over the tabletop, and affix fiducial markers to the robot and the objects, as well as the corners of the room. Using the markers and the overhead camera, we obtain real-time millimeter-level pose estimates of the robots and objects, which we then map into our simulation environment whenever there is any change.
In this way, we enable our simulation environment to mirror the real-world environment. This means we can test our policies, which were trained in simulation, directly in the real-world setup. Given a state representation rendered by the simulation, our trained policy outputs a high-level action, which is executed on the physical robot by a low-level controller. Overall, we find that trained agent behavior in the real-world environment is qualitatively similar to the simulation, but not quite as efficient due to differences in physical dynamics. We tested our best model on the SmallEmpty environment, and averaged across 5 test episodes, the real robot is able to push 8.4 out of 10 objects into the receptacle within 15 minutes, and all 10 within 30 minutes. We show videos of our robot executing policies (that were learned in simulation) in the real-world environment at https://spatial-action-maps.cs.princeton.edu.
Looking at the videos, it is interesting to observe the emergent behaviors learned by the robot. Perhaps the most common is to first push objects up against a wall with short moves, and then later “sweep” multiple objects together along the wall with a single long trajectory ending at the receptacle (note the long paths parallel to the wall in Fig. 9). This behavior is depicted in Fig. 10, where a sequence of actions from one episode is shown in time-lapse as the robot pushes two objects at once (four near the end). Additionally, other emergent behaviors include retrying, where the the robot makes multiple attempts to nudge objects towards the receptacle after initial failure.
Limitations of our setting and experiments. Because of our sim-to-real setup, our results are limited by the accuracy of our simulations (including the measurement accuracy of physical properties in our setup), the quality and robustness of our motion primitive implementations, and our assumptions of perfect localization and mapping. The first two of these are inherent to any system that is trained in simulation, and could be compensated by fine-tuning the learned model in the physical setup. How uncertainty in state and observations should be incorporated into planning is an active research topic, particularly for deep reinforcement learning, and is orthogonal to our investigation into action space representations.
Limitations of a dense action space. One inherent limitation of our spatial action maps is that they focus only on a higher-level planning problem, and assume that lower-level control will be handled separately. Concretely, we output a destination that the agent should attempt to reach, and possibly what it should attempt to do once it gets there, and rely on a motion primitive to make that happen. (In our case, the motion primitive is hand-coded, but it could equally well be learned separately.) Other action representations, in contrast, effectively integrate learning of long-range planning and immediate control end-to-end. An interesting topic for future investigation would be whether it is better to separate or integrate these stages, or to explore a hybrid strategy combining end-to-end learning with an intermediate loss on a dense action space.
In summary, we propose “spatial action maps” as an action space for mobile manipulation, and study how best to use them in our setting of training RL agents to push objects to a goal location. In our experiments, we find that an FCN can be trained to predict maps of Q-values in the spatial action map (from pixel-aligned images of input states, which encode reconstructed scene geometry and estimated shortest-path distances) more efficiently than alternative methods. We also demonstrate that our agents which were trained in simulation can transfer directly to a real-world environment without further training. Of course, this paper provides one step in a broader investigation of possible action representations in one application domain. In future work, it will be interesting to study other possible action spaces (e.g., ones that mix navigation and manipulation) and other application domains (e.g., autonomous driving).
The authors would like to thank Naveen Verma, Naomi Leonard, Anirudha Majumdar, Stefan Welker, and Yen-Chen Lin for fruitful technical discussions, as well as Julian Salazar for hardware support. This work was supported in part by the Princeton School of Engineering, as well as the National Science Foundation under IIS-1617236, IIS-1815070, and DGE-1656466.
- Amato et al.  Christopher Amato, George D Konidaris, and Leslie P. Kaelbling. Planning with macro-actions in decentralized pomdps. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2014.
- Anderson et al.  Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In
- Bhatti et al.  Shehroze Bhatti, Alban Desmaison, Ondrej Miksik, Nantas Nardelli, N Siddharth, and Philip HS Torr. Playing doom with slam-augmented deep reinforcement learning. arXiv preprint arXiv:1612.00380, 2016.
- Bojarski et al.  Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
- Bruls et al.  Tom Bruls, Horia Porav, Lars Kunze, and Paul Newman. The right (angled) perspective: Improving the understanding of road scenes using boosted inverse perspective mapping. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 302–309. IEEE, 2019.
- Chen et al.  Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learning exploration policies for navigation. In International Conference on Learning Representations, 2019.
- Chen et al.  Xi Chen, Ali Ghadirzadeh, John Folkesson, Mårten Björkman, and Patric Jensfelt. Deep reinforcement learning to acquire navigation skills for wheel-legged robots in complex environments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3110–3116. IEEE, 2018.
Coumans and Bai 
Erwin Coumans and Yunfei Bai.
Pybullet, a python module for physics simulation for games, robotics and machine learning.GitHub repository, 2016.
- Dosovitskiy and Koltun  Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. arXiv preprint arXiv:1611.01779, 2016.
- Fuchikawa et al.  Yasuhiro Fuchikawa, Takeshi Nishida, Shuichi Kurogi, Takashi Kondo, Fujio Ohkawa, Toshinori Suehiro, Yasuhiro Watanabe, Yoshinori Kawamura, Masayuki Obata, Hidekazu Miyagawa, et al. Development of a vision system for an outdoor service robot to collect trash on streets. In Computer Graphics and Imaging, pages 100–105. Citeseer, 2005.
- Gao et al.  Wei Gao, David Hsu, Wee Sun Lee, Shengmei Shen, and Karthikk Subramanian. Intention-net: Integrating planning and deep learning for goal-directed autonomous navigation. arXiv preprint arXiv:1710.05627, 2017.
- Gupta et al.  Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2017.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- He et al.  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Kolve et al.  Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474v3, 2019.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Lehner et al.  Peter Lehner, Sebastian Brunner, Andreas Dömel, Heinrich Gmeiner, Sebastian Riedel, Bernhard Vodermayer, and Armin Wedler. Mobile manipulation for planetary exploration. In 2018 IEEE Aerospace Conference, pages 1–11. IEEE, 2018.
- Levihn et al.  Martin Levihn, Jonathan Scholz, and Mike Stilman. Hierarchical decision theoretic planning for navigation among movable obstacles. In Algorithmic Foundations of Robotics X, pages 19–35. Springer, 2013.
- Lowe et al.  Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in neural information processing systems, pages 6379–6390, 2017.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Morrison et al.  Douglas Morrison, Peter Corke, and Jürgen Leitner. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. arXiv preprint arXiv:1804.05172, 2018.
- Muller et al.  Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, and Yann L Cun. Off-road obstacle avoidance through end-to-end learning. In Advances in neural information processing systems, pages 739–746, 2006.
- Nishida et al.  Takeshi Nishida, Yuji Takemura, Yasuhiro Fuchikawa, Shuichi Kurogi, Shuji Ito, Masayuki Obata, Norio Hiratsuka, Hidekazu Miyagawa, Yasuhiro Watanabe, Fumitaka Koga, et al. Development of outdoor service robots. In 2006 SICE-ICASE International Joint Conference, pages 2052–2057. IEEE, 2006.
- Pfeiffer et al.  Mark Pfeiffer, Michael Schaeuble, Juan Nieto, Roland Siegwart, and Cesar Cadena. From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots. In 2017 ieee international conference on robotics and automation (icra), pages 1527–1533. IEEE, 2017.
- Qi et al.  William Qi, Ravi Teja Mullapudi, Saurabh Gupta, and Deva Ramanan. Learning to move with affordance maps. In International Conference on Learning Representations, 2020.
- Ross et al.  Stéphane Ross, Narek Melik-Barkhudarov, Kumar Shaurya Shankar, Andreas Wendel, Debadeepta Dey, J Andrew Bagnell, and Martial Hebert. Learning monocular reactive uav control in cluttered natural environments. In 2013 IEEE international conference on robotics and automation, pages 1765–1772. IEEE, 2013.
- Savva et al.  Manolis Savva, Angel X Chang, Alexey Dosovitskiy, Thomas Funkhouser, and Vladlen Koltun. Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931, 2017.
- Savva et al.  Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE International Conference on Computer Vision, pages 9339–9347, 2019.
- Shah et al.  Pararth Shah, Marek Fiser, Aleksandra Faust, J Chase Kew, and Dilek Hakkani-Tur. Follownet: Robot navigation by following natural language directions with deep reinforcement learning. arXiv preprint arXiv:1805.06150, 2018.
- Song et al.  Shuran Song, Andy Zeng, Johnny Lee, and Thomas Funkhouser. Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations. arXiv preprint arXiv:1912.04344, 2020.
- Stilman and Kuffner  Mike Stilman and James J Kuffner. Navigation among movable obstacles: Real-time reasoning in complex environments. International Journal of Humanoid Robotics, 2(04):479–503, 2005.
- Stilman et al.  Mike Stilman, Jan-Ullrich Schamburek, James Kuffner, and Tamim Asfour. Manipulation planning among movable obstacles. In Proceedings 2007 IEEE international conference on robotics and automation, pages 3327–3332. IEEE, 2007.
- Szegedy et al.  Christian Szegedy, Alexander Toshev, and Dumitru Erhan. Deep neural networks for object detection. In Advances in neural information processing systems, pages 2553–2561, 2013.
- Tamar et al.  Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.
- Van Den Berg et al.  Jur Van Den Berg, Mike Stilman, James Kuffner, Ming Lin, and Dinesh Manocha. Path planning among movable obstacles: a probabilistically complete approach. In Algorithmic Foundation of Robotics VIII, pages 599–614. Springer, 2009.
Van Hasselt et al. 
Hado Van Hasselt, Arthur Guez, and David Silver.
Deep reinforcement learning with double q-learning.
Thirtieth AAAI conference on artificial intelligence, 2016.
- Wu et al.  Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018.
- Xia et al.  Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9068–9079, 2018.
- Yan et al.  Claudia Yan, Dipendra Misra, Andrew Bennnett, Aaron Walsman, Yonatan Bisk, and Yoav Artzi. Chalet: Cornell house agent learning environment. arXiv preprint arXiv:1801.07357, 2018.
- Yen-Chen et al.  Lin Yen-Chen, Shuran Zeng, Andy Song, Phillip Isola, and Tsung-Yi Lin. Learning to see before learning to act: Visual pre-training for manipulation. IEEE International Conference on Robotics and Automation (ICRA), 2020.
- Zakka et al.  Kevin Zakka, Andy Zeng, Johnny Lee, and Shuran Song. Form2fit: Learning shape priors for generalizable assembly from disassembly. arXiv preprint arXiv:1910.13675, 2019.
- Zapata-Impata et al.  Brayan S Zapata-Impata, Vikrant Shah, Hanumant Singh, and Robert Platt. Autotrans: an autonomous open world transportation system. arXiv preprint arXiv:1810.03400, 2018.
- Zeng  Andy Zeng. Learning Visual Affordances for Robotic Manipulation. PhD thesis, Princeton University, 2019.
- Zeng et al. [2018a] Andy Zeng, Shuran Song, Stefan Welker, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4238–4245. IEEE, 2018a.
- Zeng et al. [2018b] Andy Zeng, Shuran Song, Kuan-Ting Yu, Elliott Donlon, Francois R Hogan, Maria Bauza, Daolin Ma, Orion Taylor, Melody Liu, Eudald Romo, et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–8. IEEE, 2018b.
- Zeng et al.  Andy Zeng, Shuran Song, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. Tossingbot: Learning to throw arbitrary objects with residual physics. arXiv preprint arXiv:1903.11239, 2019.
- Zhu et al.  Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pages 3357–3364. IEEE, 2017.