Know Thyself: Transferable Visuomotor Control Through Robot-Awareness

by   Edward S. Hu, et al.
University of Pennsylvania

Training visuomotor robot controllers from scratch on a new robot typically requires generating large amounts of robot-specific data. Could we leverage data previously collected on another robot to reduce or even completely remove this need for robot-specific data? We propose a "robot-aware" solution paradigm that exploits readily available robot "self-knowledge" such as proprioception, kinematics, and camera calibration to achieve this. First, we learn modular dynamics models that pair a transferable, robot-agnostic world dynamics module with a robot-specific, analytical robot dynamics module. Next, we set up visual planning costs that draw a distinction between the robot self and the world. Our experiments on tabletop manipulation tasks in simulation and on real robots demonstrate that these plug-in improvements dramatically boost the transferability of visuomotor controllers, even permitting zero-shot transfer onto new robots for the very first time. Project website:



There are no comments yet.


page 2

page 4

page 5

page 6

page 7

page 8

page 16

page 17


RoboNet: Large-Scale Multi-Robot Learning

Robot learning has emerged as a promising tool for taming the complexity...

Hardware Conditioned Policies for Multi-Robot Transfer Learning

Deep reinforcement learning could be used to learn dexterous robotic pol...

Zero Shot Learning on Simulated Robots

In this work we present a method for leveraging data from one source to ...

Sim-to-Real Transfer Learning using Robustified Controllers in Robotic Tasks involving Complex Dynamics

Learning robot tasks or controllers using deep reinforcement learning ha...

General Robot Dynamics Learning and Gen2Real

Acquiring dynamics is an essential topic in robot learning, but up-to-da...

Synthesizing Modular Manipulators For Tasks With Time, Obstacle, And Torque Constraints

Modular robots can be tailored to achieve specific tasks and rearranged ...

OSCAR: Data-Driven Operational Space Control for Adaptive and Robust Robot Manipulation

Learning performant robot manipulation policies can be challenging due t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Raw visual observations provide a versatile, high-bandwidth, and low-cost sensory modality for robot control. However, despite the huge strides in machine learning for computer vision tasks in the last decade, extracting actionable information from images remains challenging. As a result, even simple robotic tasks, such as pushing objects using image observations, commonly require data collected over many hours of robot interaction to synthesize effective visual controllers. If such controllers could transfer reliably and easily to new target robots, then this data collection cost would be amortized. For example, a hospital adding a new robot to its robot fleet could simply plug in their existing controllers and start using it immediately. Going further, other hospitals looking to automate the same tasks could purchase a robot of their choice and download the same controllers.

However, such transferable visuomotor control is difficult to achieve in practice. We observe that the changed visual appearance of the target robot generates out-of-distribution inputs for any learned modules within the controller that operate on visual observations. This issue particularly affects manipulation tasks: manipulation involves operating in intimate proximity with the environment, and any cameras set up to observe the environment cannot avoid also observing the robot.

Figure 1: (Left) As part of our method, robot-aware control (RAC), we propose a modular visual dynamics model, that factorizes into a learned world and analytical robot dynamics model. (Right) The unseen robot’s analytical model is used to predict robot dynamics, permitting easy robot transfer.

There is a way out of this bind: most robots are capable of highly precise proprioception and kinesthesis to sense body poses and movements through internal sensors like joint encoders. We propose to use these advanced proprioception abilities and other readily available self-knowledge to develop “robot-aware” visual controllers that can benefit from reliably disentangling pixels corresponding to the robot from those corresponding to the rest of the “world” in image observations.

Robot-aware visual controllers treat the robot and world differently in the control loop to their advantage. While this general principle applies broadly to all visual controllers, we demonstrate the advantages of robot awareness in controllers that work by planning through learned visual dynamics models, commonly called “visual foresight” (Finn and Levine, 2017; Ebert et al., 2018). First, we inject robot awareness into the dynamics model, training it to be invariant to robot appearance, by omitting the robot pixels in its input. This corresponds to a shared robot-invariant “world dynamics” model. We then propose to complete this dynamics model by pairing it with self-knowledge of analytical robot dynamics for each robot. This corresponds to a factorization of the dynamics into world and robot dynamics. Figure 1 shows a schematic. Our experiments show that composing these two modules permits reliably transferring visual dynamics models even across robots that look and move very differently.

Next, we design a robot-aware planning cost over the separated robot and world pixels. We show that this not only allows visual task specifications to transfer from a source to a target robot, it even leads to gains on the source robot itself by allowing the controller to easily reason about the robot and its environment separately.

We evaluate our method, Robot-Aware Control (RAC), on simulated and real-world tabletop object pushing tasks, demonstrating the importance of each robot-aware component. In our most surprising result, we show that a robot-aware visual controller trained for object manipulation on a real-world, low-cost 5-DoF arm can transfer entirely “zero-shot”, with no new training, to a very different 7-DoF Franka Panda arm.

2 Background: The visual foresight paradigm for visuomotor control

Task setup. Our robot-aware controller builds on the visual foresight (VF) paradigm (Finn and Levine, 2017; Ebert et al., 2018). In VF (Finn and Levine, 2017; Ebert et al., 2018), at each discrete timestep , the input to the controller is an RGB image observation from a static camera observing the workspace of a robot. Since the robot must operate in close proximity with objects in the workspace, these images contain parts of the robot’s body as well as the rest of the workspace. We define a visual projection function that maps from true robot state and “world” (i.e., the rest of the workspace) state to the image observation . Tasks are commonly specified through a goal image that exhibits the target configuration of the workspace. Given the goal image , the robot controller must find and execute a sequence of actions to reach that goal, where is a time limit. In the manipulation settings where VF has mainly been applied, robot states could be joint and end-effector positions, actions could be end-effector position control commands, and world states could be object poses.

Visual foresight algorithm. Visual foresight based visuomotor control involves two key steps:

  • [leftmargin=*]

  • Visual dynamics modeling: The first step is to perform exploratory data collection on the robot to generate a dataset of transitions . Dropping time indices, we sometimes use the shorthand to avoid clutter. Then, a visual dynamics model is trained on this dataset to predict given and as inputs, i.e. . When the robot state is available, it is sometimes included as a third input to to assist in dynamics modeling.

  • Visual MPC: Given the trained dynamics model , VF approaches search over action sequences to find the sequence whose outcome will be closest to the goal specification , as predicted by . For outcome prediction, they apply recursively, as . Then, they pick the action sequence , where the cost function is commonly the mean pixel-wise error between the predicted image and the goal image. Sometimes, rather than measure the error of only the final image, the cost function may sum the errors of all intermediate predictions. For closed-loop control, only the first action is executed; then, a new image is observed, and a new optimal action sequence is computed, and the process repeats. For action sequence optimization, we found the cross-entropy method (CEM) De Boer et al. (2005) to be sufficient, although more sophisticated optimization methods (Zhang et al., 2019; Rybkin et al., 2021) are viable as well.

To train reliable dynamics models , visual foresight approaches commonly require many hours of exploratory robot interaction (Finn and Levine, 2017; Ebert et al., 2018) even for simple tasks. Data requirements may be reduced a little by interleaving model training with data collection: in this case, data is collected by selecting goal images and then running visual MPC using the most recently trained dynamics model to reach those goals. However, such goal selection must be designed to match the target task(s), which may trade off task generality for sample complexity.

Performance metrics. Note that in the above, only images , and sometimes the robot state , are usually available for goal specification and as inputs to the controller. VF controllers aim to match the final image to the goal image , as described above. However, performance metrics are usually best specified as functions of the final world state and the goal state . For example, the goal image might show a stack of five blocks on a table, in which case success might be measured in terms of the number of blocks successfully stacked by the controller.

3 Transferable robot-aware visuomotor control

Having trained one data-hungry visual dynamics model on one robot, do we still need to repeat this process from scratch on a new robot aiming to perform the same tasks? In standard VF, the answer is unfortunately yes, for two main reasons: (1) With the new robot, all observations are out-of-domain for : it has never seen the new robot. Further, the visual dynamics of the new robot may be very different, as between a green 3-DOF robot arm, and a red 5-DOF robot arm. (2) Standard VF approaches commonly require the task specification image itself to contain the robot. This makes it impossible to plan to reach using a different robot. We show how these obstacles may be removed through “robot-awareness” to produce transferable VF controllers.

Robot self-knowledge assumptions. We assume a robot whose state , i.e., the configuration of its body, is easily and accurately available through internal proprioceptive sensors. Further, the embodiment of the robot is fully known, including: (1) its morphology

, a full geometric specification of the robot’s embodiment including all components, how they are linked, and degrees of freedom, (2) its

dynamics , the way the robot moves in response to high-level robot motion commands, and (3) camera parameters, including extrinsic robot-camera calibration and camera intrinsics.

Grounding this discussion in the context of the experiments in this paper, it is easy to see that static rigid-linked robot arms satisfy all three requirements. Joint encoders typically provide very precise proprioception, URDF files are readily available from manufacturers, as are end-effector Cartesian position controllers that can execute commands like “move the gripper 10 cm forward to the left.” Finally, camera calibration is easy to perform. Beyond static manipulation, these assumptions are also broadly true for many other settings such as wheeled mobile manipulation, and UAV flight.

Becoming aware of the robot in the scene. Where in the visual scene is the robot? At any given time, the robot’s current state and its morphology fully specify its 3D shape in robot-centric coordinates. It is then trivial to project this shape through the camera parameters to obtain the visual projection of the robot in the scene. We represent this as an image segmentation mask of the same height and width as image observations : is on robot pixels and elsewhere. We show examples of such masks in Fig 2, and explain how we implement this projection in the appendix. This simple process effectively spatially disentangles robot pixels from world pixels in image observations . These robot masks are a key building block for robot-aware VF, since they permit controllers to treat robot and world pixels differently in both components of VF, as explained below.111We discuss how these masks could be improved to handle occlusions, by using RGB-D observations in the appendix.

3.1 Robot-aware modular visual dynamics

Figure 2: The visual dynamics architecture is composed of an analytical robot model , and the learned world model.

Recall that the robot’s dynamics are known to us as a function , where denotes the next timestep. How might we incorporate this knowledge into our visual dynamics model? In this section, we propose an approach to effectively factorize the visual dynamics into world and robot dynamics terms to facilitate this.

Given the robot state and the image observation , we first compute the projected robot mask as above, then mask out the robot in the image to obtain , where is the pixel-wise product. See Fig 2 for examples of such masked images. Now, we train a world-only dynamics model by minimizing the following error, summed over all transitions in the training dataset :


where the next robot state is computed using the known robot dynamics, as . Treating all dynamics models as probabilistic, this is equivalent to the following decomposition of full visual dynamics into a robot dynamics module and a world dynamics module :


What advantage does this modularity offer? In this modularized dynamics model, is available for every robot “out-of-the-box”, requiring no data collection or training. Thus, the only learned component of dynamics is the world dynamics module .

Since largely captures the physics of objects in the workspace, we hypothesize that it can be shared across very different robots. For example, two very different robot arms pushing or grasping the same object will commonly have similar effects on that object.

Specifically, suppose that a robot-aware world-only dynamics module has been trained on a robot arm with robot dynamics for various manipulation tasks as seen in Figure 1. The full dynamics model, used during visual MPC, would be as in Eq (2). Then, given a new robot arm with dynamics , its full dynamics model is available without any new data collection at all. We validate this in our experiments, demonstrating few-shot and zero-shot transfer of between very different robots.

3.2 Robot-aware planning costs

Recall from Sec 2, that visual MPC relies on the planning cost function , which measures the distance between a predicted future observation and the goal . It is clear that for any task, the ideal planning cost is best specified as some function of the robot configurations and the world configurations . However, since only image observations are available, it is common in VF to aim to minimize a pixel-wise error planning cost (Tian et al., 2019; Nair and Finn, 2020; Jayaraman et al., 2019), such as , that cannot distinguish between robot regions and world regions. Our disentangled observations make it easy to decompose the cost into a robot-specific cost and a world-specific cost.

Figure 3: The behavior of the pixel and RA cost between the current images of a feasible trajectory and the goal image. (Left) The first row shows the trajectory. The next two rows show the heatmaps (pixel-wise norm) between the image and goal for each cost. The heatmaps show high cost values in the robot region, while the RA cost heatmaps correctly have zero cost in the robot region. (Right) The relative pixel cost fails to decrease as the trajectory progresses, while the RA cost correctly decreases.

Why should this decomposed cost help? This decomposition makes it possible to separately modulate the extent to which robot and world configurations affect the planning cost. The basic pixel-wise cost suffers from a key problem: it is affected inordinately by the spatial extents of objects in the scene, so that large objects get weighted more than small objects. Indeed, Ebert et al. (2018) report that the robot arm itself frequently dominates the pixel cost in manipulation. This means that the planner often selects actions that match the robot position in a goal image, ignoring the target objects. See Figure 3 for a visual example of the pixel cost behavior and its failure to compute meaningful costs for planning. As a result, even if the task involves displacing an object, the goal image is usually required to contain the robot in a plausible pose while completing the task Nair and Finn (2020). For our goal of robot transfer, this is an important obstacle, since the task specification is itself robot-specific. Even ignoring transfer, robot-dominant planning costs hurt performance, and gathering goal images with the robot is cumbersome. Our approach does not require robots to be in plausible task completion positions in the goal images. In fact, they may even be completely absent from the scene. In our experiments, we use goal images without robots, and sometimes with humans in place of robots. To handle them, we set to 0 in the above equation to instantiate a cost function that ignores the robot region. We refer to this as the robot-aware cost.

Some VF-based methods do handle goal images without robots; however, they typically require all input images for dynamics and planning to also omit the robot, so that image costs are exclusively influenced by the world configurations Wang et al. (2019); Pathak et al. (2018); Agrawal et al. (2016). For closed-loop controllers, this often means extremely slow execution times, because, at every timestep, the robot must enter the scene, execute an action, and then move out of the camera view. This also eliminates tasks requiring dynamic motions.

Finally, there are efforts to learn more sophisticated cost functions over input images  Nair et al. (2020); Sermanet et al. (2018); Srinivas et al. (2018); Yu et al. (2019); Tian et al. (2020). For example, Nair et al. (2020) train a latent representation to focus on portions of the image that are different between the goal and the current image, and show that costs computed over these latents permit better control on one robot. These approaches all learn the cost contributions of different objects or regions, from data. Instead, we directly disentangle the robot using readily available information. While we restrict our evaluation to basic pixel-based costs, we expect that these other costs will also benefit from disentangled inputs, i.e., .

3.3 Implementation details

For implementing the learned world dynamics model , we use the SVG architecture (Denton and Fergus, 2018). For input to the world model, the robot state is represented as the end effector pose and the mask of the robot. The current and future mask are concatenated with the image, and the end effector pose is tiled into the latent spatial embedding. To facilitate transfer to new robots, we project a black "proxy” robot into the image at all times to induce robot color invariance. See the appendix A.1 for details.

4 Experiments

(a) WidowX200
(b) Modified WidowX200
(c) Franka
Figure 4: Pictures of the WidowX200, modified WidowX200, and Franka robot.

Our experiments focus on zero and few-shot transfer across robots for standard object pushing tasks Finn and Levine (2017); Ebert et al. (2018); Dasari et al. (2019): a robot arm must push objects to target configurations, specified by a goal image, on a cluttered tabletop. We aim to answer: (1) How does the robot-aware controller compare against a standard controller when transferred to a new robot? (2) Which robot-aware components in the visual foresight pipeline are most important for transfer?

4.1 Transferring visual dynamics models across robots

Model Robot
State Analytical
Dynamics Proxy
Robot World
VF - - - -
VFState - - -
RA (ours)
Table 1: Feature comparison between baselines and the robot-aware dynamics model.

Setup. First, we study how well the world dynamics module transfers across robots. Table 1 presents the baselines. Prior visual foresight methods Ebert et al. (2018); Finn and Levine (2017) have used action-conditioned models that solely rely on image input (VF), as well as models that rely on robot state and images (VFState). For all prediction experiments, we predict 5 steps into the future and evaluate the world region PSNR and SSIM averaged over 5 timesteps. We perform experiments in simulation as well as on three separate real-world settings. Additional details about calibration, data, training, and evaluation are in the appendix section A.4.

Simulated zero-shot transfer from one robot. Simulation permits a “straight swap” of the robot with perfectly controlled viewpoints and environments. We train on 10k trajectories of length 30, involving a 5-DOF WidowX200 arm performing random actions on a tabletop with several objects. We evaluate these models zero-shot on 1000 trajectories of a new robot, a 7-DOF Fetch arm.

Zero-shot transfer from one robot. First, we evaluate zero-shot transfer to an unseen Franka robot (Fig. 4c) from models trained on 1.8k videos videos of a single WidowX200 robot (Fig. 4a) that randomly perturbs the objects. For this setting, to avoid domain differences in the world region observations, we test on the Franka robot in the same workspace.

Zero-shot transfer from multiple robots. Next, we evaluate zero-shot transfer to the Franka robot, as well as to an unseen Modified WidowX200 (Fig. 4b) from models trained on a multi-robot dataset. To modify the appearance and dynamics of the WidowX200, we swapped out the black 14cm forearm link with a silver 20cm forearm link, and added a foam bumper and sticker.

The models are pretrained on 83k videos including 82k RoboNet Dasari et al. (2019) videos of Sawyer, Baxter, and WidowX robots, and 1k videos from the above WidowX200222Note that the WidowX and WidowX200 are different robots. dataset. We used a subset of RoboNet for which we were able to manually annotate camera calibration matrices and robot CAD models. See the appendix section A.6 for details.

Few-shot transfer to WidowX200 from RoboNet. Finally, we evaluate few-shot transfer to an unseen WidowX200 by first pretraining the models on RoboNet data, and then finetuning on 400 videos of the WidowX200. This is consistent with experiments reported in RoboNet Dasari et al. (2019).

Sim. Fetch
(Train on WidowX200)
(Train on 3 robots)
Mod. WidowX200
(Train on 4 robots)
(Train on 4 robots)
(Train on WidowX200)
VF 39.9 0.985 32.86 0.939 31.40 0.924 28.31 0.901 28.42 0.905
VFState - - 33.05 0.941 31.43 0.924 29.19 0.901 28.31 0.894
RA (Ours) 41.8 0.990 33.26 0.944 32.12 0.929 29.37 0.904 29.63 0.928
Table 2: World dynamics model evaluations, reported as average world region PSNR and SSIM over five timesteps. We compare against conventional models that use image and/or state.
Figure 5: Model outputs of the zero-shot experiments. First row: the RA model is able to predict contact with the green box, while the baseline (VFState) predicts none. Next row: the RA-model models the downward motion of the octopus more accurately. Refer to the website for videos.


Table 2 shows all quantitative prediction results for average PSNR and SSIM per timestep over a 5-step horizon. The robot-aware world dynamics module outperforms the baselines across all zero-shot and few-shot transfer settings, simulated and real, on both metrics. To supplement the quantitative metrics, Fig 5 shows illustrative examples of 5-step prediction, where we recursively run visual dynamics models to predict images 5 steps out from an input.

We show more examples of prediction results in the appendix and website, and synthesize some salient observations here. First, the baseline VFState model overlays the original robot over the target robot rather than move the test time robot (see Figure 5, bottom right). Next, RA produces relatively sharper object images and more accurate object motions, as with the pink octopus toy in Figure 5. While both VFState and RA models predict the motion of large objects better than that of small objects, VFState more frequently predicts too small, or even zero object motions (see Figure 5, top right). Finally, we notice that the baseline model fails to predict object motion for smaller objects, as seen in Figure 5 top right. As we will show in Sec. 4.2, these performance differences in object dynamics prediction are crucial for successfully executing manipulation tasks.

Figure 6: Example results from the control experiments (full videos on website). We show the start, goal, and end states of RA/RA and baseline (VFState/Pixel). The goal image is overlaid on top of the end state for visual reference.

4.2 Transferring visual controllers across robots

VFState Pixel - 0/30 (0%)
RA Pixel - 0/30 (0%)
VFState RA - 6/30 (20%)
RA RA - 22/30 (71%)
VFState Pixel 11/30 (36%)
RA RA 27/30 (90%)
Table 3: Object pushing success rates for zero-shot transfer to an unseen Franka robot.

Next, we evaluate our full robot-aware pipeline on object pushing tasks in two settings: WidowX200 few-shot transfer from multiple robots, and Franka zero-shot transfer from a single WidowX200 robot. Aside from serving as further evaluation of the RA dynamics models, this also permits us to evaluate the RA planning cost. In appendix section A.2, we evaluate the RA cost in isolation by planning with perfect dynamics models.

Following the visual foresight setup, we evaluate all methods on their ability to move objects to match a goal image. We experiment with combinations of predictive models and cost functions to probe their transferability to new robots. Going forward we refer to all controllers by the names of the dynamics model and the cost, e.g., RA/RA for the full robot-aware controller.

After each evaluation episode, we measure the distance between the pushed object’s current and goal centroids. If it is less than 5cm, the task is successful. A single push task is defined by a goal image. We vary the push direction, length, and object for generating the goal images. The pushes range from 10-15cm in length. For each task, we use an object-only goal image where the object is in the correct configuration, but the robot is not in the image. See the appendix for more details.

Table 3 and 4 shows that RA/RA achieves the highest success rate. We observed two primary failure modes of the baseline and ablations in zero-shot scenarios. First, both VFState/Pixel and RA/Pixel, which optimized the pixel cost, tend to retract the arm back into its base, because the cost penalizes the arm for contrasting against the background of the goal image as seen in Figure 6. Next, consistent with our observations in Section 4.1, VFState/Pixel and VFState/RA suffer from predicting blurry images of the training robot and inaccurate object dynamics, which negatively impacted planning.

The failure of the baseline to zero-shot transfer is expected, since generalizing to a new robot from training on a single robot dataset is a daunting task. One natural improvement is to train on multiple robots before transfer to facilitate generalization. We evaluate RA/RA and VFState/Pixel in this multi-robot pretraining setting, where they are trained on RoboNet videos in addition to our WidowX200 dataset. As Table 3 shows, the baseline controller improves with training on additional robots, but still falls far short of RA/RA trained on a single robot. RA/RA improves still further with multi-robot pretraining.

Dynamics Model Cost Success
VF Pixel 4/50 (8%)
RA RA 40/50 (80%)
Table 4: Few-shot transfer to WidowX200 control results.

Few-shot transfer. Next, we evaluate the controllers in the few-shot setting, where the models are pretrained on RoboNet and finetuned on 400 videos of the WidowX200. As seen in Table 4, RA/RA significantly outperforms the baseline controller. Due to the baseline model’s inability to model object dynamics aside from the the largest object, as mentioned in Section 4.1, the baseline controller succeeds only on pushes with the largest object.The RA model adequately models object dynamics for all object types, and is able to succeed at least once for each object.

Figure 7: Example human goal image.

Imitating human goal images. Finally, we evaluate the robot-aware cost’s effectiveness on achieving goal images with a human arm. Videos are on the project website. We collect five goal images by recording human pushing demonstrations, and use the last image from each video as the goal image as seen in Figure 7. Human masks are annotated using Label Studio Tkachenko et al. (2020). We run RA/RA and RA/Pixel controllers, which differ only in the planning cost. RA/RA achieves all goal images, whereas RA/Pixel fails on all goal images.

5 Prior work on robot transfer outside the visual foresight paradigm

Several works have focused on transferring controllers between tasks (Duan et al., 2017; Finn et al., 2017b), environments (Sermanet et al., 2018; Finn et al., 2017a), and simulation-to-real (Tobin et al., 2017), but relatively little attention has been paid to transfer across distinct robots. Aside from RoboNet Dasari et al. (2019), discussed above, a few prior works have studied providing the robot morphology as input to an RL policy, enabling transfer to new robots (Chen et al., 2018; Devin et al., 2017; Wang et al., 2018). Devin et al. (2017) train modular policies for simulated robots containing a robot module and a task module, both learned from data. To zero-shot transfer a task controller to a new robot, they require the learned robot module for the new robot, trained on other related tasks. Chen et al. (2018)

propose to train a "universal" policy conditioned on a vector representation of the robot hardware, alleviating the need to train the robot module during transfer to a new robot. In contrast, we factorize task-agnostic dynamics models rather than task-specific policies, leverage readily available analytical robot models that permit plug-and-play transfer to completely unseen robots, and evaluate on real robots.

Robotic self-recognition literature assumes limited or no self-knowledge, and aims to identify the robot in visual observations Michel et al. (2004); Natale et al. (2007); Edsinger and Kemp (2006). Recently, Yang et al. (2020) propose an approach for fast, coarse-grained visual self-recognition and control of the robot’s end-effector. We assume knowledge of robot morphology, dynamics, and camera calibration, which are commonly available or easy to acquire in many settings. We then focus on learning zero-shot-transferable world-only visual dynamics models for object manipulation tasks.

6 Limitations

We rely on robot self-knowledge of its morphology, proprioception, camera calibration, and dynamics, for zero-shot transfer of visuomotor controllers across robots. However, this knowledge may be imperfect. Noisy joint encoders, inaccurate calibration, low-level controller failure resulting in wrong dynamics, or damage to the robot may all affect the performance of robot-aware controllers. However, it may be possible in these cases to learn or partially update this self-knowledge in response to observed deviations, such as by learning a residual robot dynamics model for each robot. Another limitation is that the world dynamics models in Eq (2), while they transfer well in our object pushing tasks, might not as easily transfer for more sophisticated contact-rich tasks such as in-hand object manipulation. In such cases, this term may depend more heavily on the contact dynamics of each individual robot’s end-effectors. This concern may be addressed by sharing end-effectors across robots, or by introducing an additional learned term into the dynamics factorization, to capture contact dynamics.

7 Conclusion

In conclusion, we have demonstrated that simple modifications to exploit readily available robot self-knowledge can permit widely used visual foresight-based controllers to transfer across very different robots. In future work, this knowledge may also be integrated into other visual control settings including visual foresight with more sophisticated image-based cost functions, as well as other approaches to visual control, such as model-free reinforcement learning.

This work was partly supported by an Amazon Research Award to DJ. The authors would like to thank Karl Schmeckpeper for help with RoboNet, Leon Kim for support with the Franka, and the Perception, Action, and Learning (PAL) research group at UPenn for constructive feedback.


  • P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine (2016) Learning to poke by poking: experiential learning of intuitive physics. arXiv preprint arXiv:1606.07419. Cited by: §3.2.
  • T. Chen, A. Murali, and A. Gupta (2018)

    Hardware conditioned policies for multi-robot transfer learning

    In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §5.
  • S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper, S. Singh, S. Levine, and C. Finn (2019) RoboNet: large-scale multi-robot learning. In 3rd Annual Conference on Robot Learning, CoRL 2019, Osaka, Japan, October 30 - November 1, 2019, Proceedings, L. P. Kaelbling, D. Kragic, and K. Sugiura (Eds.), Proceedings of Machine Learning Research, Vol. 100, pp. 885–897. External Links: Link Cited by: §A.4, §4.1, §4.1, §4, §5.
  • P. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein (2005) A tutorial on the cross-entropy method. Annals of operations research 134 (1), pp. 19–67. Cited by: 2nd item.
  • E. Denton and R. Fergus (2018) Stochastic video generation with a learned prior. In International Conference on Machine Learning, pp. 1174–1183. Cited by: §A.1, §3.3.
  • C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine (2017)

    Learning modular neural network policies for multi-task and multi-robot transfer

    In 2017 IEEE International Conference on Robotics and Automation, ICRA 2017, Singapore, Singapore, May 29 - June 3, 2017, pp. 2169–2176. External Links: Link, Document Cited by: §5.
  • Y. Duan, M. Andrychowicz, B. Stadie, J. Ho, J. Schneider, I. Sutskever, P. Abbeel, and W. Zaremba (2017)

    One-shot imitation learning

    In Advances in Neural Information Processing Systems, pp. 1087–1098. Cited by: §5.
  • F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568. Cited by: §1, §2, §2, §3.2, §4.1, §4.
  • A. Edsinger and C. C. Kemp (2006) What can i control? a framework for robot self-discovery. In 6th International Conference on Epigenetic Robotics, Cited by: §5.
  • C. Finn, P. Abbeel, and S. Levine (2017a) Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126–1135. Cited by: §5.
  • C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 2786–2793. External Links: Document Cited by: §1, §2, §2, §4.1, §4.
  • C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine (2017b) One-shot visual imitation learning via meta-learning. In Conference on Robot Learning, pp. 357–368. Cited by: §5.
  • D. Jayaraman, F. Ebert, A. A. Efros, and S. Levine (2019) Time-agnostic prediction: predicting predictable video frames. ICLR. Cited by: §3.2.
  • P. Michel, K. Gold, and B. Scassellati (2004) Motion-based robotic self-recognition. In IROS, Cited by: §5.
  • S. Nair and C. Finn (2020)

    Hierarchical foresight: self-supervised learning of long-horizon tasks via visual subgoal generation

    In International Conference on Learning Representations, External Links: Link Cited by: §3.2, §3.2.
  • S. Nair, S. Savarese, and C. Finn (2020) Goal-aware prediction: learning to model what matters. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 7207–7219. External Links: Link Cited by: §3.2.
  • L. Natale, F. Orabona, G. Metta, and G. Sandini (2007) Sensorimotor coordination in a “baby” robot: learning about objects through grasping. Progress in brain research 164, pp. 403–424. Cited by: §5.
  • D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell (2018) Zero-shot visual imitation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    pp. 2050–2053. Cited by: §3.2.
  • O. Rybkin, C. Zhu, A. Nagabandi, Daniilidis,Kostas, Mordatch,Igor, and Levine,Sergey (2021) Model-based reinforcement learning via latent-space collocation. International Conference on Machine Learning (ICML). Cited by: 2nd item.
  • P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine (2018) Time-contrastive networks: self-supervised learning from video. In IEEE International Conference on Robotics and Automation, pp. 1134–1141. Cited by: §3.2, §5.
  • A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn (2018) Universal planning networks. arXiv preprint arXiv:1804.00645. Cited by: §3.2.
  • S. Tian, F. Ebert, D. Jayaraman, M. Mudigonda, C. Finn, R. Calandra, and S. Levine (2019) Manipulation by feel: touch-based control with deep predictive models. In 2019 International Conference on Robotics and Automation (ICRA), Vol. , pp. 818–824. External Links: Document Cited by: §3.2.
  • S. Tian, S. Nair, F. Ebert, S. Dasari, B. Eysenbach, C. Finn, and S. Levine (2020) Model-based visual planning with self-supervised functional distances. External Links: 2012.15373 Cited by: §3.2.
  • M. Tkachenko, M. Malyuk, N. Shevchenko, A. Holmanyuk, and N. Liubimov (2020) Label Studio: data labeling software. External Links: Link Cited by: §4.2.
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 23–30. Cited by: §5.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §A.3.
  • A. Wang, T. Kurutach, K. Liu, P. Abbeel, and A. Tamar (2019) Learning robotic manipulation through visual planning and acting. arXiv preprint arXiv:1905.04411. Cited by: §3.2.
  • T. Wang, R. Liao, J. Ba, and S. Fidler (2018) NerveNet: learning structured policy with graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • B. Yang, D. Jayaraman, G. Berseth, A. Efros, and S. Levine (2020) MAVRIC: morphology-agnostic visual robotic control. ICRA and RA-L. Cited by: §5.
  • T. Yu, G. Shevchuk, D. Sadigh, and C. Finn (2019) Unsupervised visuomotor control through distributional planning networks. In Proceedings of Robotics: Science and Systems (RSS), Cited by: §3.2.
  • M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. Johnson, and S. Levine (2019) SOLAR: deep structured representations for model-based reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 7444–7453. Cited by: 2nd item.

Appendix A Appendix

We encourage the reader to visit the website at the url: for an explanation video and additional video visualizations.

a.1 Network details

The robot-aware dynamics model described in Section 3.1 consists of two modules, an analytical robot-specific dynamics model and a learned, visual dynamics world model . The robot dynamics model is required to predict the next robot state (end effector pose and mask) from the current robot state and action and feed that as input into the world model. This future robot state and mask is useful for the world model to predict the overall future state, as it can infer object displacement from such information.

Neural network architecture for the learned visual dynamics . We extend the stochastic video generation (SVG) architecture [Denton and Fergus, 2018] which consists of the convolutional encoder, frame predictor LSTM, and decoder alongside a learned prior and posterior.

The encoder consists of 4 layers of VGG blocks (convolution, batch norm, leaky RELU) and passes skip connections to the decoder. We use a convolutional LSTM for the learned prior, posterior, and frame predictor. As seen in Figure

2, visual data such as the current RGB observation with dimension (3,48,64), current mask with dimension (1,48,64), and future mask (1, 48, 64) are concatenated channel-wise and fed into the encoder, which convolves the input into a feature map of dimension (256, 8, 8). Additional data such as the action, current state, and future state are tiled onto the feature map before being passed into the recurrent frame predictor, outputting a feature map of size (256, 8, 8) that gets decoded into the next RGB image .

Generating predictions from the dynamics model at test time. During training, the RA-model utilizes ground truth future robot state as input for prediction since all data is recorded in advance. From our prediction results in Table 2, we find that using future robot state and masks to be useful for future prediction. But how might we produce the future input for during test time?

We use the analytical dynamics model as described in Section 3.1 to compute the future inputs. We now describe this process with more detail. The analytical dynamics model predicts the future robot state as a function of the current state and action . For example, if the control commands are robot end-effector displacements, then the robot state is the gripper pose, and next state . Then, to obtain the masks, we must compute the full robot pose and its visual projection in two simple steps. First, we compute the joint positions using inverse kinematics , and then project the joint positions into a mask . We use the analytical inverse kinematics solver from the PyRobot library to obtain future joint positions for the WidowX200 while for the Franka, we use the MoveIt! motion planning ROS package.

Next, we use a simulator such as MuJoCo that supports 3D rendering of the robot along with camera calibration to project the image of the virtual robot into a 2D mask. These steps are also important for recursively applying the dynamics models to produce “rollouts”. Given a starting image and actions , we can predict as above. To predict , needs as input , all of which are inferred as above through the analytical dynamics . Algorithm 1 provides pseudocode to generate such rollouts.

1:  Input: World model , analytical model , Start image , start mask , start state , actions .
6:  for  to  do
8:     .
12:  end for
14:  return  


Algorithm 1 Test
def ra_l1_loss(pred, target, mask):
    difference = target - pred # (3, H, W)
    # repeat mask across channel axis so we can use it
    # to index the robot region in difference
    robot_region = mask.repeat(3,1,1)
    # ignore the pixel differences in the robot region
    difference[repeat_mask] = 0
    # compute the l1 norm over the world pixels
    num_world_pixels = (~robot_region).sum()
    l1_per_pixel = diff.abs().sum() / num_world_pixels
    return l1_per_pixel


Algorithm 2 Pytorch code for the loss.

RA-loss implementation. In Algorithm 2, we provide a PyTorch implementation of the robot-aware dynamics loss (eq. 1) over RGB images. Note that the principle of separating the robot region and world region in the loss computation is general, and can be applied to other state formats as well.

Proxy-robot. An image is composed of the robot image and world image where is the mask of the robot. We propose to edit all images by replacing the pixels in the robot region with a proxy robot such that the resulting images are approximately invariant to the appearance of the original robot before all further processing. In practice, we found it sufficient for our “proxy robot” to simply be black over all the robot pixels, though other choices are possible. Through this simple procedure, becomes largely invariant to the appearance (such as color, lighting, and texture) of the robot at test time.

a.2 Evaluating the robot-aware cost in isolation

Figure 8: We evaluate the performance of the conventional cost function V-C and robot-aware cost function RA-C by running 100 push tasks where the robot must achieve the object-only goal image. Then we visualize for each task, the final pose distance between the object and the object pose in the goal image. The dashed horizontal line indicates the success threshold of 1cm. RA-C succeeds in most tasks by outputting a sensible cost over the world region, while V-C’s cost is distorted by the robot.

The control experiments in performed in Section 4.2 evaluate the robot-aware cost with various choices of dynamics models. Here, we evaluate the cost function more closely in isolation by using ground-truth dynamics of the simulator instead of a learned model.

In section 3.2, we propose to separate the conventional pixel-wise cost into a robot-specific and world-specific cost. We report the results of a simulated experiment, where we set up a block pushing environment with the Fetch robot. The environment consists of three objects with varying shapes, colors, and physics. We sample the goal image by moving one of the objects from its initial pose to a random pose 10cm away.

We then run the visual foresight pipeline using ground truth dynamics and the given cost function for 5 action steps, and record the final distance between the object and its pose in the target image. Success is defined as moving the object within 1cm of the goal pose. As seen in Figure 8, RA-C gets 95% success rate, where as V-C only gets 16% success rate.

RA-C is able to disregard the extraneous robot when computing the pixel difference between the current image with the robot and the goal image without the robot. V-C on the other hand, results in the CEM selecting actions that move the robot out of the scene rather than moving the object to the correct pose.

a.3 Additional simulated WidowX experiment.

In addition to the real-world transfer examples reported in the main paper, we now experiment with robot transfer in MuJoCo simulations Todorov et al. [2012] by setting up a tabletop manipulation workspace. During training, the model is trained on 10,000 videos of a gray WidowX200 5-DOF robot performing random exploration. Then, we evaluate the model on various modifications of the WidowX. First, we try changing the color of the entire WidowX from gray to red. Next, we extend the forearm link of the WidowX by 10cm, similar to the real-world modified WidowX experiment in Section 4.1. Finally, we evaluate on a WidowX with a longer and red link. As seen in Table 5, the robot-aware model outperforms the vanilla model in world PSNR and SSIM metrics.

For the color change, the vanilla model predicts a still image for all timesteps, which suggests that the network does not recognize the red robot as the training time robot. Our model is invariant to the color shift of the robot due to the proxy robot procedure, and can accurately predict the trajectory with little degradation in quality.

With the link length change, the vanilla model is able to predict robot movement, but it replaces the longer link with the original short link. In some cases, the longer link allows the robot to contact the object and move it. Our model correctly predicts the object movement, but the vanilla model fails to model the object interaction and movement since the object contact is not possible with the shorter link.

Finally, in the longer and different color link setting, the vanilla model is able to recognize and predict movement for the unaltered parts of the robot. It replaces the long red link with the original short gray link, and leaves the long red link in the image as an artifact. Similar to the previous experiment, the vanilla model has degraded object prediction while our model can still accurately predict the dynamics.

Color Change
Link Change
Color and
Link Change
VF 37.5 0.976 42.3 0.994 40.3 0.987
RA (Ours) 38.7 0.985 45.2 0.997 40.9 0.991
Table 5: Zero-shot video prediction results to a modified WidowX200 in simulation. The modified WidowX200 has different color and/or link lengths.

a.4 Visual dynamics experiment details

We now present some details of the visual dynamics experiments in Section 4.1 where we evaluated models on unseen robots as seen in Figure 4.

Training and finetuning details. Models were pretrained for 150,000 gradient steps on the RoboNet dataset with the Adam optimizer, learning rate and batch size of 16. We use scheduled sampling to change the prediction loss from 1-step to 5-step future prediction over the course of training. For fine-tuning, all models were trained on the fine-tune dataset for 10,000 gradient steps with learning rate of , batch size of 10, and scheduled sampling.

Evaluation details. To evaluate PSNR and SSIM metrics over the world region of the images instead of the entire image, we preprocess the images by setting the robot region of the images to black. This corresponds to removing all pixel differences between the predicted and target robot region before computing the PSNR and SSIM over the entire region.

For all experiments, we evaluate on sequences of length 5 and calculate the average world PSNR and SSIM metrics over the timestep. Because the models are stochastic, we perform Best-of-3 evaluation where 3 videos are sampled from the model, and we report the sample with the best PSNR and SSIM.

Dataset. Following RoboNet [Dasari et al., 2019] conventions, the image dimension is 64 by 48 pixels. and the action space of the robot is the displacement in end effector pose where are the cartesian coordinates of the gripper with respect to robot base, is the yaw of the gripper, and is the gripper force. The videos are length 31, but we sample subsequences of length 6 from the video to train the network to predict 5 images from 1 conditioning image. Due to the workspace bounds varying in size across robots, RoboNet chooses to normalize the cartesian coordinates to [0, 1] using the minimum and maximum boundary of each workspace.

The WidowX200 dataset consists of 1800 videos collected using a random gaussian action policy that outputs end effector displacements in the action space. For the pushing task, we fix to constants, and sample displacements from a 2 dimensional gaussian with 0 mean and in centimeters. Similar to RoboNet, the states are normalized by the workspace minimum and maximum. We also collected 25 trajectories of the modified WidowX200 as seen in 4 for evaluation.

Figure 9: Qualitative outputs of the zero-shot prediction experiments. From left to right, we show the ground truth start, ground truth future, and predictions for the robot-aware model and VFS model. The baseline blurs the bear in the top row, and predicts minimal movement for the octopus in the bottom row. The proxy robot is visualized in green. Refer to the videos on the website for more visual comparisons.

a.5 Control experiment details.

Figure 10: The objects used in the control experiments. The few-shot WidowX250 transfer experiment used all objects, while the zero-shot Franka experiment used the bear and watermelon.

The bear, watermelon, box, octopus, and shark used in the control experiments are seen in Figure 10, and vary in size, texture, color, and deformability.

In the few-shot WidowX200 control experiment, the robot is tasked with pushes on five different objects that vary in shape, color, deformability, and size. We chose two directions, a forward and sideways direction, giving a total of 10 push tasks. We run 5 trials for each push task for a total of 50 pushes per method. For the zero-shot Franka experiment, the controllers are evaluated on two different objects pushed in three different directions. Each pushing task is repeated 5 times for a total of 30 trials per controller.

CEM action selection. As mentioned in Section 2

, the CEM algorithm is used to search for action trajectories that minimize the given cost function. The CEM hyperparameters are constant across controllers to ensure that all methods get the same search budget for action selection. For the few-shot WidowX200 experiment, the CEM action selection generates 300 action trajectories of length 5, selects the top 10 sequences, and optimizes the distribution for 10 iterations. For the zero-shot Franka experiment, the CEM action selection generates 100 action trajectories of length 5, selects the top 10 sequences, and optimizes the distribution for 3 iterations.

a.6 Robot masks and calibration.

Acquiring robot masks that accurately segment the robot and world is crucial for the robot-aware method, as it depends on the mask to train the dynamics model and evaluate costs as mentioned in Section 3.1 and 3.2. The first step to acquiring robot masks is to get the camera calibration, which is the intrinsics and extrinsics matrix of the camera. If we have physical access to the robot and camera, acquiring the camera calibration is trivial. In our experiment setup with the Franka and WidowX200, we use AprilTag to calibrate the camera extrinsics, i.e. the transformation between camera coordinates and robot coordinates given the camera intrinsics.

Extracting calibration from RoboNet. However, if we do not have access to the robot and camera, as in RoboNet, acquiring camera calibration is still possible. RoboNet does not contain the camera extrinsics information, but does contain the camera model information. Therefore, we use the default factory-calibrated camera intrinsics for each corresponding camera model.

Next, RoboNet contains the the 3D positions of the robot end-effector in the robot coordinates for each image. For each viewpoint, we hand-annotate a few-dozen 2D image coordinates of the corresponding end-effector, and use OpenCV’s camera calibration functionality to regress the camera extrinsics given the camera intrinsics and labeled 3D-2D end-effector point pairs.

Synthesizing masks. Once we have the camera calibration of the robot, we render the 3D model of the robot, and project it to a 2D segmentation map using the camera calibration. We use the MuJoCo simulator for this process. Conveniently, the MuJoCo simulator can render a segmentation map of the geometries for a given camera viewpoint. By setting up an empty MuJoCo scene with only the robot geometries, we can render the geometry segmentation map and use that as the robot mask.

Using depth. Observe that the robot mask above is computed from only the robot state, without any reference to the world state. As a result, it can not account for robot occlusion by objects in the scene. For example, if the robot is pushing a large object towards the camera, the part of the object occluding the robot will count as the robot region. If we assume access to RGB-D observations however, it is easy to refine to remove occluded regions. To do so, we may compute the distance of all robot pixels through the above projection, and zero out pixels in the mask where is greater than the observed depth at that pixel. In other words, pixels that are closer to the camera than the computed distance of the robot at that location must correspond to occluders. In practice, in our experiments, we find it sufficient to use RGB cameras and ignore occlusions.